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Abstract 



Consider an input text string T = T[1,N] drawn from an unbounded alphabet, so text 
positions can be accessed using comparisons. We study partial computation in suffix-based 
problems for Data Compression and Text Indexing such as 

• retrieve any segment of K < N consecutive symbols from the Burrows- Wheeler transform 
of T, which is at the heart of the bzip2 family of text compressors, and 

• retrieve any chunk of K < N consecutive entries of the Suffix Array or the Suffix Tree, 
two popular Text Indexing data structures for T. 

Prior literature would take 0(N log N) comparisons (and time) to solve these problems by 
solving the total problem of building the entire Burrows- Wheeler transform or Text Index for T, 
and performing a post-processing to single out the wanted portion. The technical challenge is 
that the suffixes of interest are potentially of size O(KN) and overlap in intricate ways: we have 
to use structural properties of these overlaps to avoid rescanning them repeatedly. 

We introduce a novel adaptive approach to partial computational problems above, and solve 
both the partial problems in 



comparisons and time, improving the best known running times of 0(N log N) for K — o(N). 

These partial-computation problems are intimately related since they share a common bot- 
tleneck: the suffix multi-selection problem, which is to output the suffixes of rank n, T2, ■ ■ ■ , Tk 
under the lexicographic order, where r\ < < ■ ■ ■ < Tki r i €E [1, N]. Special cases of this prob- 
lem are well known: K = N is the suffix sorting problem that is the workhorse in Stringology 
with hundreds of applications, and K = 1 is the recently studied suffix selection. 

We show that suffix multi-selection can be solved in 



time and comparisons, where r$ — 0, Tk+i = N + 1, and Aj = r J+ i — r*j for < j < K. 
This is asymptotically optimal, and also matches the bound in [7] for multi-selection on atomic 
elements (not suffixes). Matching the bound known for atomic elements for strings is a long 
running theme and challenge from 70's, which we achieve for the suffix multi-selection problem. 
The partial suffix problems as well as the suffix multi-selection problem have many applications. 
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0(K log K + N) 




1 Introduction 



Consider an input text string T = T[l,iV] and the set S of its suffixes Tj = T[i,N] (1 < i < iV) 
under the lexicographic order, where T[iV] is an endmarker symbol $ smaller than any other symbol 
in T. The alphabet £ from which the symbols in T are drawn is unbounded: as is standard in 
Stringology, we assume that any two symbols in £ can only be compared and this takes 0(1) time. 
Hence, comparing symbolwise any two suffixes in S may require 0(N) time in the worst case0 

We study partial computation problems in Data Compression and Text Indexing for T where 
we want to quickly get a sense of the lexicographic distribution of the text suffixes. 

Partial Data Compression. The Burrows- Wheeler transform L (a.k.a. bwt) [3] of text string T 
is at the heart of the bzip2 family of text compressors, and has many applications. The rth symbol 
in L is T[j — 1] if and only if Tj is the rth suffix in the sorting (except the borderline case j = 1, for 
which we take T[7V]). There are now efficient methods that convert T to L and vice versa, taking 
0(N log N) time for unbounded alphabets in the worst case. 

A partial compression problem is to consider a range L' = L[i..i+K—1] of K consecutive symbols 
in L. Can we compute L' efficiently? More precisely, can we compute L' without computing the 
entire LI This is an interesting building block for partial estimation of data compression ratio. 

There is prior work that studies partial compression problems where a range T[i, i+K— 1] needs 
to be compressed (by Lempel-Ziv or Burrows- Wheeler or one of the other compression methods). 
This can be accomplished in 0(K log K + N) time using off-the-shelf tools. Instead, what is 
interesting in our question above is that we seek a range in the compressed string L, and computing 
L[i..i + K — 1] amounts to sorting a set of irregular, arbitrarily scattered suffixes Tj 1 ,Tj 2 , . . . ,Tj K 
for which we do not know the positions j\ , , . . . , jx a, priori] 

Partial Text Indexing. Several text indexes, such as suffix arrays [21] and suffix trees |24l 129] . 
are based on the lexicographic order of the suffixes in the text T and the longest common prefix (lep) 
information among them. These can be computed in 0(N log N) time using well-known algorithms 
that exploit properties of suffixesjl while the rest of the indexes can be easily built in O(N) time. 

We define a K -partial text index as a range [i, i + K — 1] of the index (consecutive entries of the 
suffix array or leaves from the suffix tree): this corresponds to a sorted set of irregularly scattered 
suffixes Tj 1 ,Tj 2 , . . . ,Tj K for which we do not know their positions j±, j'2, . . . ,jk a priori, together 
with the length of their longest common prefix (lep). The technical challenge here is similar to 
partial compression above, but additionally, we need to compute (lep) information. 

We refer the reader to Sections 12.11 and 12.21 for a more detailed discussion. 

Basic questions. Can partial suffix-based computations like compression and indexing above, 
be solved more efficiently than solving the total problems? Besides the inherent interest in such 
problems and their structure that will let us parameterize their complexity in terms of K G [1, iV], 
the main applied interest is that these partial problems give us a way to look at spots of a long 
string and get a sense for the complexity of data, be it for compressibility or performance of a 
full-text index. 

The central technical challenge is the following. Given K symbols, sorting them is trivial. 
However, if we have to sort K arbitrary suffixes, they are of size 0(KN) in the worst case, and 
we can not afford to compare them symbolwise. In the worst case, it is better to perform a total 
sorting in 0(N\ogN) time when K = Q, (log N), as the suffixes overlap in arbitrary ways and we 
have to avoid rescanning the symbols repeatedly. Can we better this 0(N log N) bound? 

1 Number of comparisons is asymptotically the same as the time for all the algorithms discussed throughout this 
paper, and hence we will use them interchangeably. 

2 It has an O(N) time solution since 70's for constant-size alphabet E |24l [29| and, more recently, for (bounded 
universe) integer alphabet E [Sj ; otherwise, it uses 0(N log |E|) = 0(N log N) comparisons in unbounded alphabets E. 
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Our results. We introduce a new adaptive approach to suffix sorting and order statistics. 



Theorem 1 Given a text T of length N, partial compression and partial text indexing problems 
can be solved in 0(K \og K + N) time. 

Hence for K = o(N), partial compressing and text indexing problems can be solved asymp- 
totically faster than their total counterparts. In particular, for K = 0(N/\ogN), these partial 
problems can be solved in O(N) time, which is quite rare in the comparison model. 

We can also provide a bound for an arbitrary choice of K ranks ri,T2, ■ ■ ■ , tk in the suffix order. 

Theorem 2 Given a text T of length N , partial compression and indexing can be solved using 



v 3=0 ' 

time and comparisons, where ro = 0, rjc+i = N + 1, and Aj = rj + i — t« for < j < K ; here, 
1 < n < r2 < ■ ■ ■ < rx < N are the ranks of the suffixes involved in the output. 

The algorithms behind Theorem [1] use an intermediate stage before applying the algorithms 
behind the more general Theorem [21 as otherwise the cost would be 0(K log N + N) by choosing 
consecutive ranks r%, . . . , tk from the given range of values (i.e. Aj = 1 for 1 < j < K). 

The above partial-computation problems share a common bottleneck: the suffix multi-selection 
problem, which is to output the suffixes of rank r%, r%, . . . , rj^ under the lexicographic order, where 
r i < r 2 < • • • < r K-, r i & [1>-W]- Special cases of this problem are well known: K = N is the 
standard suffix sorting problem, and K = 1 is the recently studied suffix selection for which 0{N)- 
time comparison-based solutions are now known |114 [12] . We refer the reader to Section 12.31 for a 
more detailed discussion. 

Theorem 3 Given a text T of length N , the K text suffixes with ranks r\ < ri < ■ ■ ■ < rx ( and 
the lep 's between consecutive suffixes ) can be found within the bound stated in equation (pQ) . This 
is optimal. 

Related work. A long running theme in string matching has been matching for suffixes of a 
string, what one can do for atomic elements. The earliest suffix tree algorithms of 70's [241 [29] 
were interesting because they sort suffixes of a string over constant-sized alphabet in O(N) time, 
matching the bucket sorting bound for iV elements. However, it took lot longer to match the 
O(N) time of radix sorting for strings over an integer alphabet in 90's [8], and in other computing 
models (9[ 116] . For selection, the classic O(N) time bound for atomic elements from 70's was 
matched only recently for string suffix selection |12[ [TT] . 

Similarly, it has been a technical challenge as we show here to match the multi-selection bound 
of atomic elements for string suffixes. Multi-selection for suffixes is not only interesting for reasons 
multi-selection problem in general is interesting, i.e., for statistical analysis of string suffixes, but 
also because it emerges naturally as the computational bottleneck of several problems like the 
partial compression and indexing problems described above, which have no natural counterpart in 
study of atomic elements. 

For the sake of completeness, we recall that multi-selection of atomic elements includes basic 
problems such as sorting [K = N) and selection (K = 1) as special cases [18] . Selection algorithms 
that work in expected linear time |10[ [T3] or worst-case linear time [2] are now in textbooks. 
Multi-selection can also model intermediate problems between sorting and selection: for example, 
setting equally spaced r^'s, it corresponds to the quantile problem in Statistics. The asymptotically 
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Table 1: BWT L for the text T = mississippi$ and its relation with the sorted suffixes. 

optimal number of comparison (and running time) is that in equation ([T]) as proved in |7j H Our 
suffix multi-selection algorithm is optimal, matching the lower bound even for atomic elements! 

Paper organization. We first give more details on partial data compression, partial text indexing 
and suffix multiselection in Section [21 so as to relate the former two problems to the latter one. 
Then, the rest of the paper is devoted to suffix multi-selection (Theorem [3D with the following 
organization. We give the main ideas in and introduce the main concepts of subproblems and 
agglomerates, their data structures and algorithms, in Section [3j The top-level description of our 
multi-selection of suffixes is given in Section [H and then all the implementation details are given 
in Section [5j Finally, we focus on the correctness and the analysis of the costs in Sections [6] and [7] 

2 Partial Data Compression, Partial Text Indexing, and Suffix 
Multi-Selection 

We discuss here some of the new features that make the multi-selection problem on suffixes in- 
teresting and challenging, and focus on its applications to Data Compression and Text Indexing. 
To evaluate the benefits of our findings, we present some examples illustrating which tasks can be 
done optimally with known techniques and which cannot. At the same time we show that using 
our novel techniques, we can now perform optimally the latter tasks, which only had suboptimal 
algorithms so far. 

2.1 Partial Data Compression 

The Burrows- Wheeler transform (bwt) [3] is at the heart of the bzip2 family of text compressors. 
Consider all the N circular shifts of the text T = mississippi$ as shown in the first column 
(original) of Table HJ Perform a lexicographic sorting of these shifts, as shown in the second column 
(sorted): if we single out the last symbol from each of the circular shifts in this order, we obtain a 
sequence L of N symbols that is called the bwt of T. Interestingly, not only we can recover T from 
L alone, but typically L is more compressible than T itself using Oth-order compressors (e.g. |22j). 
Its relation with suffix sorting is well known: the rth symbol in L is T[j — 1] if and only if Tj is 

3 The bound in (TTJ can be refined by studying the actual constant factors hidden in the notation [15] . Several 
papers have studied other variations [TU [19] [20] [23] l26l |28] . 
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the rth suffix in the sorting (except the borderline case j = 1, for which we take T[iV]), as shown 
in the third column (suffixes). 

Data compression ratio can be partially estimated by choosing a suitable sample V of L for 
statistical purposes. There are several ways to make this choice, some are easy and some others 
are not, as we show next. It is easy to build L' if we take every other qth suffix in T. For example, 
q = 3 gives L' = $pss since we pick Ti,T^,Tj,Tiq and then perform their lexicographic sorting, 
namely, T\ < Tiq < Tj < T4. The latter is a simple variant of the standard suffix sorting and 
takes 0(N log(N/q)) time: the text T is conceptually partitioned as a sequence of N/q macro- 
symbols, where each macro-symbol is a segment of q actual symbols in T (could be less in the 
last one). Suffix sorting requires 0(N/qlog(N/q)) macro-comparison, each involving q symbolwise 
comparisons, thus giving the above bound. 

What if L' is chosen by taking every other qth. symbol directly in L? Contrarily to the previous 
situation, here we guarantee a uniform sampling from L. For example, q = 3 gives L' = isps, 
which corresponds to selecting the suffixes Tyi < T5 < T\q < T4 as shown in boldface in Table [TJ 
Here comes a crucial observation: even though we sample from L with regularity, the starting 
positions of the chosen suffixes from the input text T form an irregular pattern and are difficult 
to predict without suffix sorting. We are not aware of any better approach other than performing 
a full execution of suffix sorting and, then, making a post-processing to single out every other 
gth sorted item. This yields a suboptimal cost of 0(N log N) which should be compared to the 
0(Nlog(N/q)) = 0(N\ogK) cost in equation {TJ) by setting Aj = q for j < K and A K < q + 1. 
In general, specifying ranks ri,r2, . . . ,rjc gives the sample made up of the rith, r2th, . . . , r^th 
symbols of L: using the algorithm giving the cost in ([TJ), we can look at specific parts of the 
compressed string, without paying the full suffix sorting cost. We refer the reader to Theorem [2j 

An intriguing situation arises when we consider just a segment of K consecutive symbols. Let us 
first consider a text segment T[o, b], where K = b — a + 1 and 1 < a < b < N. It takes 0(K log K) 
time to perform a suffix sorting and compute the bwt of that segment alone; or 0(K\ogK + N) 
time to sort the consecutive suffixes T a ,T a+ i, . . . ,Tf, whose starting positions lie in that segment, 
and then find their induced symbols inside L. Once again, these are simple variations of suffix 
sorting. 

What if we want to compute only a segment L[a, b] of K = b — a + 1 consecutive symbols 
instead of the whole L? For example, £[3,5] = ssm corresponds to suffixes T% < T5 < T2 in 
Table [TJ In general, the starting positions of these suffixes form an irregular pattern and, as far as 
we know, the best that we can do is performing a full suffix sorting in 0(N log N) time. Instead, 
setting r\ = a, r<i = a + 1> ■ ■ ■ , tk = b, we obtain that Ao = a, = N + 1 — b, and Aj = 1 for 
1 < j < K. Hence, the cost implied by equation (pQ) is 0(K log N + N), and the computed longest 
common prefixes in Theorem [3] will also provide the contexts for estimating the empirical entropy 
of L restricted to the segment L[a, b\. Even better, we can obtain 0(K log K + N) using the same 
algorithm behind equation ([T]) and an analysis focussed on this special case (i.e. the wanted ranks 
form an interval of consecutive values, see Lemma [TH]) . When K = o(N), this compares favorably 
with the suboptimal 0(N log N) cost of building L explicitly. We refer the reader to Theorem [TJ 

2.2 Partial Text Indexing 

Several text indexes, such as suffix arrays |21j and suffix trees [244129]. are based on the lexicographic 
order of the suffixes in the text T and the longest common prefix (lep) information among them. 
Ultimately, the suffix sorting and the lep information constitute the kernel upon which the rest of 
the index can be easily built in O(N) time. An instance of suffix tree for the text T = mississippi$ 
is shown in Figure [lj left), and consists of a compacted trie storing all the suffixes of T. 

Based on the above observation, we define a K-partial text index as a subset of K suffixes plus 
their lep information. Having this, we can build the suffix array or the suffix tree restricted to these 
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Figure 1: A suffix tree and two ways to sample it. Only the first symbol is shown on each labelled 
edge. 

K suffixes in 0(K) time. Hence, the problem of building a K-partial text index tantamounts to 
performing a sorting of these K suffixes and finding their lep information. We discuss two ways of 
choosing these K suffixes. 

One possibility is sampling every other gth suffix in T, as shown in Figure Oocenter) with q = 3. 
This is the sampled suffix tree introduced in [T7], and its construction is a simple variant of the 
standard one and takes 0(Nlog(N/q)) = 0(N log K) time: as previously mentioned, the text T is 
conceptually partitioned as a sequence of K = N/q macro-symbols. 

Another possibility is sampling every other gth leaf directly from the suffix tree, as shown in 
Figure QJright) with q = 3. However, with the current techniques we have a suboptimal solution: 
build first the whole suffix tree in 0(N log N) time and then perform a post-processing to select 
the wanted leaves and their ancestors (removing the possible unary nodes thus created). Using the 
algorithm behind equation (pQ) and the related longest common prefix information as a byproduct, 
we are able to build this i-T-partial index in 0(iV log (N/q)) = 0(N log K) time by fixing Aj = q 
for j < K and Ak < q + 1- Contrarily to the sampled suffix tree above, here the starting positions 
of the chosen suffixes form an irregular pattern even though we sample from the suffix tree with 
regularity. 

In general, given a text T and its ranks r%, r2, ■ ■ ■ , rx, we want to build the X-partial text index 
for the suffixes having those ranks. For example, when employing a suffix tree, the .fT-partial text 
index gives the subtrie made up of the rith, r2th, . . . , r^th leaves of the suffix tree for T. But we 
do not want to build that full suffix tree explictly in 0(N log N) time. Using the algorithm behind 
equation ([I]), we can attain this goal. We refer the reader to Theorem [2j 

A somewhat surprising situation arises when considering just a segment of K consecutive sym- 
bols. Consider first the i^-partial index build on the text segment T[a,b], where K = b — a + 1 
and 1 < a < b < N. It takes 0(K log K) time to build the suffix array or the suffix tree for that 
segment alone, or 0(K log K + N) time to build them for the consecutive suffixes T a , T a+ i , . . . , 
whose starting positions lie in that segment. Once again, these are simple variations of known 
algorithms. 

This is not the case when computing just a segment of K consecutive entries in the suffix array, 
or just a chunk of K consecutive leaves from the suffix tree. Clearly, we do not want the suboptimal 
solution that builds the entire suffix array or suffix tree in 0(N log N) time. Ours is a special case of 
K-partial text index, since the wanted suffixes have consecutive ranks r\ = a, r<i = a+1, . . . , tk = b. 
The 0(K log N+N) cost implied by equation (pQ) can be refined (using Lemma fTHj) so that we obtain 
0(i"Clogi<C + N), comparing favorably with the suboptimal cost of building the suffix array/tree 
entirely. In general, the task of computing only a chunk of K consective entries from the suffix 
array or the suffix tree falls within the following result. We refer the reader to Theorem [H 
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2.3 Suffix Multi-Selection Problem 



Given a set S of N elements from a total order <, the rank of x £ S is r £ [1,-/V] if x is the 
rth smallest in S, namely, r = 1 + \{y € S^y < x}| . For integers 7*1 < r 2 < • • • < tki where each 
Ti € [1, N], the multi-selection problem is to select the elements of rank n, r%, . . . , tk from S. 

Multi-selection includes basic problems such as sorting and selection as special cases. When 
K = N, it finds the ranks for all the elements, making it straightforward to arrange them in 
sorted order [18] . When K = 1, it corresponds to the standard selection problem: given an integer 
r G [1,-/V], the goal is to return the element of rank r in S. Selection algorithms that work in 
expected linear time |10l [13] or worst-case linear time [2] are now in textbooks. It can also model 
intermediate problems bewteen sorting and selection: for example, setting equally spaced r^'s, it 
corresponds to the quantile problem in Statistics. Multi-selection can thus arise in applications for 
partitioning the input, say for a recursive approach. 

To the best of our knowledge the first algorithm for multi-selection was given in [7], thus 
establishing that the asymptotically optimal number of comparison (and running time) is that in 
equation (pQ), where ro = and r^+i = N + 1 and Aj = r^+i — rj for < j < K. The formula 
in ([1]) can be intuitively read as follows. Find the Ao smallest elements, then the Ax next smallest 
elements, and so on, up to the last Ak ones. The resulting arrangement is almost sorted, and can 
be fully sorted by ordering each individual group of Aj elements independently in 9(Ajlog Aj) 
time. Hence, take the total sorting cost of Q(N log N), subtract the cost of sorting each group, i.e. 
<d(Y^f = o Ajlog Aj), and add Q(N) to read all the elements as a baseline. Note that rewriting ^ 

as Q(N Ylf=o(Aj/N) log(JV/Aj) + N), where Y^f=o A i = N + 1, we can reformulate the bound 
in ([1]) as @(N(Hq + 1)) where Hq = — f = o Pj 1 log 2 pj is the empirical Oth-order entropy where 
Pj = Aj/N is the empirical probability of having the jth. group of size Aj. 

The asymptotical optimality of ([1]) can be further refined by studying the actual constant 
factors hidden in the O notation. Any comparison-based algorithm must perform at least B = 
NlogN — ~Y^j = Q Aj log Aj — O(N) comparisons to solve multi-selection, and the algorithm in [15] 
is nearly optimal in this sense. It attains B + 0{N) expected comparisons and B + o(B) + O(N) 
comparisons in the worst case, taking 0(B + N) running time. For the interested reader, other 
papers have studied further features in [Til [T9l [20ll23| [26| [28] . 

We focus on the asymptotically optimality of the bound in ([1]) and call optimal an algorithm 
that asymptotically meets that bound, namely, Q(B + N) in the worst case. Unless specified, the 
running time is always proportional to the number of pairwise comparisons between elements. 

In this paper, we study the analog of the multi-selection problem in Stringology. Consider an 
input text string T = T[1,N] and the set S of its suffixes Tj = T[i,N] (1 < i < N) under the 
lexicographic order. Let T[N] be an endmarker symbol, denoted by $, which is smaller than any 
other symbol in T. The alphabet E from which the symbols in T are drawn is unbounded and the 
comparison model is adopted: any two symbols in S can only be compared and this takes constant 
time. Hence, comparing symbolwise any two suffixes in S may requires O(N) time in the worst 
case. Given ranks r% < < ■ ■ ■ < tk, the suffix multi- selection problem is to output the suffixes 
of rank n, r%, . . . , r& in S. Our main contribution is that we extend the asymptotically optimal 
bound in ([T|) to the suffixes in S. We refer the reader to Theorem [3] 

Recall that the length of the longest common prefix of any two suffixes and Tj is defined as 
the smallest t > such that T[i + £] 7^ T\j + £]. We also can obtain this info as a byproduct, which 
is useful for the problems mentioned in Sections 12.11 and 12.21 
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3 Concepts, Definitions, and Main Ideas 



For the given string T = T[l, iV], let T[N] be an endmarker symbol, denoted by $, that is smaller 
than any symbol of the alphabet and does not appear elsewhere in T. Let Tj = T[i, N] be the 
i-th suffix of T, and S = {Tj | 1 < % < N} be the set of all these suffixes. Given the set 7Z of ranks 
r i < r 2 < • • • < r A"> we want to select the suffixes Tj x < Tj 2 < • • • < T, lK such that T^. has rank rj 
in S, for 1 < j < K . Since we already know that T/v is the smallest suffix of T, we can assume wlog 
that n > 1- Also, we use < and < to denote string comparison according to the lexicographical 
order. For any two sets X, Y, notation X < Y indicates that x < y for any pair x € X,y £ Y . 

Our main goal is to prove Theorem[3l as this is the major obstacle when proving Theorems [T]-[2j 
The tricks of the trade all rely on the following "golden rule" on the lep information for the suffixes: 
For any integer d > 0, if lcp(Ti, Tj) > £ for some integer £ > d, then lcp(Ti + d, Tj +( i) > £ — d. Thus, 
if Tj and Tj have been compared, the direct comparison of the first £ — d symbols of Tj + ^ and 
can be avoided. Conversely, if Tj + d and Tj + d have been compared, the comparison of all but the 
first d symbols in Tj and Tj can be avoided. 

Unfortunately, we cannot always rely on the golden rule here. When comparing Tj and Tj, we 
do not yet know whether or not both Tj + d and Tj + d will have to be compared (directly or indirectly 
by transitivity) for a choice of d > 0, or vice versa: simply put, we cannot predict at each stage of 
the computation whether the comparison between Tj + d and Tj + d will occur or not in the future. 

3.1 Subproblems 

At the beginning, all the suffixes form a single problem S = {T\, T2, . . . , T n }. At the end, we want to 
obtain a partition of S into subproblems, namely, S = 5iU{Tj 1 }uS2U{Tj 2 }U- • •U5ft-U{Tj K }uSx+i> 
such that Sj contains all the suffixes in S of rank r with rj_i < r < rj for f < j < K + 1. Although 
each Sj is not internally sorted, still S\ <{Tj 1 } < S2 <{T{ 2 } < ■ ■ ■ < Sk <{Ti K } < Sk+i- 

At an arbitrary stage of the computation, the suffixes in S and the ranks in 1Z are partitioned 
amongst the subproblems. Let us call V\ < V2 < • • • < V z the subproblems in the current stage, 
where S = V\ U Vi U • • • U V z , and let TZ(Vi) be the set of ranks associated with subproblem Vi, 
for f < i < z. Namely, r E TZ(Vi) iff li < r < £j +|7 3 i| 3 where £i = J2j<i ls the number of 
lexicographically smaller suffixes. 

Status. A subproblem Vi is solved ii\Vi\ = 1^(^)1 = 1) an d unsolved ii\Vi\ > 1 and \7Z(Vi)\ > 1. 
A subproblem V% and its suffixes are exhausted, if 17^(^)1 = 0. A subproblem V% and its suffixes 
are degenerate if each of these suffixes share an empty prefix with any of the wanted suffixes 
Tjj < T{ 2 < • • • < Ti K . (Note that (i) a degenerate subproblem is also exhausted and (ii) T/v is 
always degenerate since we assume r\ > 1.) 

A subproblem Vi is never merged with others and can only be refined into smaller subproblems. 
A partition of Vi into 7\ , . . . , Vi p is called a refinement if for any two Vi j , Vi ., either Vi i < V{ ., or 
Vi., < Vi-\ note that the refinement is a stronger notion than the partition, since it also takes into 
account of the lexicographic order among the suffixes. 

During the computation a subproblem can be either active or inactive. Yi\R,{J > i)\ > 1 then Vi is 
active. An active subproblem is subjected to the refinement, whereas this does not hold anymore 
once it becomes inactive. If Vi is inactive then 17^(7^)1 = and thus it is exhausted. (The latter is a 
necessary condition: not all the exhausted subproblems are inactive.) Once a subproblem becomes 
inactive it will stay inactive until the end. Degenerate subproblems are inactive since the start. We 
give a complete characterization in Section [3.21 

Integer labels and neighborhood. Ideally, we would like to maintain the integer label £i = 
Ylj<i\'Pj\ f° r eac h subproblem Vi during the refining process. Using the integer labels, we can 
define a total order relation <a and an equivalence relation ~ on the suffixes of T. Consider any two 
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suffixes Tif,Tji and their subproblems' labels £i,£j. We have that Tj/ <Tji iff (1) T[i'] < T\j'\ when 
both Tj/ and Tji are degenerate or (2) 1% < £j otherwise. Similarly, Tj/ ~ Tj/ iff (1) T[i'] = T[j'] 
when both Tj and Tj are degenerate or (2) l{ = lj otherwise. The total order < is consistent with 
the lexicographical order: if Tj/ < Tj/ and Tj/ Ty then Tj/ < Tj/. If the labels for two suffixes are 
known then comparing them according to < and ~ takes O(l) time. 

After an active subproblem V% becomes inactive, it is possible that some of its suffixes are moved 
to form other inactive subproblems V^'s but we still need the value of £j. For this reason, we need 
to introduce a more general notion, that of neighborhood Mi = V{ U (Ji=i Vu > t° preserve what was 
once an individual active subproblem V%. Summing up: if V% is active then Mi = Vf, else, Mi 2 Vi- 
la any case, the reference label is £i = Yljsf-<j\f- = K^jl^j < f° r an y ^ £ 

For the sake of description, each subproblem V% or neighborhood Mi, and by extension each 
suffix in them, is conceptually associated with an a-string 014. If V% is non- degenerate then Tj has 
cti as prefix iff Tj € Mi- If "Pj is degenerate, a weaker property holds since |aj| = 0: for any Vj 
not in A/j, either Vj < Mi or Mi < Vj. Observing that only the integer labels are used to compare 
suffixes according to < and ~, the a-strings will not be maintained during the computation. For 
the presentation in the paper, we will focus on subproblems rather than neighbors, keeping in mind 
that the label of an inactive subproblem is that defined for its neighborhood. 

Rationale. We refine the subproblems as follows. We pick an unsolved subproblem V% and refine 
it into smaller ones: We find the closest ranks rj,rj^.\ for Vi partitioning it "evenly," namely, 
Tj < £i +\Vi\/2 < Tj+\. Then, we select the suffixes Tj.,Tj. +1 with ranks rj,rj + \ and partition 
Vi into three new subproblems according to Tj. and T j+1 . The new subproblem with the middle 
suffixes is exhausted since it has no ranks associated (and will form Sj+i), while the other two 
subproblems are still unsolved. The goal is to reach a situation in which each subproblem Vi is 
either exhausted (Vi = Sj for some j) or solved (Vi = {Tj.} for rank Tj and some j): namely, 
S = Vi U V 2 U • • • U V z = Si U{T 1 } U S2 U{T 2 } U • • • U S K U{Tj K } U Sk+i is the resulting refinement 
of the initial problem S. This scheme works if we suppose that S contains independent strings. 
Unfortunately, S contains the suffixes of T and so the rescanning cost is the main obstacle. 

3.2 Agglomerates of subproblems 

We group subproblems into agglomerates to model the interplay among suffixes that share the same 
a-string. We represent each agglomerate as a threaded dynamic tree where each node represents a 
subproblem (a subset of suffixes), as illustrated in the example of Fig. [2j 

Dependency. For any two subproblems Vi,Vj, we say that Vi depends directly on Vj if the 
following hold: (i) Vi = Vj or (ii) Vi 7^ Vj and for each T x £ Vi we have that T x+ \ E Vj. We 
extend this relation by transitivity, denoted C + , which is a partial order. When Vi E + Vj we say 
that Vi depends on Vj. We extend this terminology to single suffixes: a suffix T € V%< depends on 
Tj G Vj' (denoted by T □+ Tj) if i < j, V v E + V? and T x £ V V U Vj', tori<x<j. 

Partial order and tree representation. A set of subproblems is an agglomerate A if there 
exists Vi € A such that Vj E + Vi, for each Vj G A (i.e. A has a maximum according to the partial 
order E + ). The Hasse diagram according to Q + for an agglomerate A is a tree whose root is the 
maximum subproblem and the leaves are the minimal subproblems of A. Moreover, the children 
of an internal node are subproblems directly depending on it (see Fig. [2|). We denote by j^4| the 
number of subproblems in A, and apply tree terminology to agglomerates. A suffix T x G V% £ A 
(x > 1) is a contact suffix if T x _i G Vj and Vj A. A subproblem Vi G A is a contact subproblem 
if it contains at least one contact suffix. Each leaf of A is a contact node. Also, some internal nodes 
may be contact ones. When we consider the contact nodes in preorder (see grey nodes in Fig. [2]), 
we call this the contact visiting order. Contacts nodes are useful for the refinement. 
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Status. An agglomerate A is exhausted if all its subproblems are either exhausted or solved. An 
unsolved subproblem of an agglomerate A is a leading subproblem if none of its ancestors in A are 
unsolved. An agglomerate A is unsolved if the following holds: 

Property 1 (Leading Subproblem Property) There exists a leading subproblem V w in A s.t. 
(i) Each ancestor ofV w has only one child, and (ii) none of the ancestors ofV w is a contact node. 

Note that Property [TJ(z) implies that an unsolved agglomerate has only one leading subproblem. 
At any time, a subproblem V% £ A is active iff A is unsolved and either (i) Vi is unsolved or (ii) Vi 
is not solved and is a contact node of A (in this case Vi can be exhausted but still active). For any 
agglomerate A and two active subproblems V%,Vj £ A, the following properties hold: (a) aj and ctj 
have a non-void common suffix; (b) if Vi is a descendant of Vj in A then otj is a proper suffix of aj. 

Columns. The agglomerates induce columns: a column C of A is the pair (cj, rj) such that 
T Ci C + T r . where T Ci is a contact suffix (i.e. it belongs to one of ^4's contact nodes) and T n is a root 
suffix of A (i.e. it belongs to ^4's root). Each column T[cj, r$] is a contiguous substring of T and, at 
any time, all the columns of all the agglomerates form a non-overlapping partition of T. A column 
(cj,ri) is associated with the contact subproblem Vj such that T c% G "Pj. We will denote by ||7-j || 
the number of columns associated with a contact subproblem Vj. In Fig. [2j each agglomerate has 
its columns depicted beside its tree. The columns are shown as contiguous substrings of T (and T 
as a non-overlapping partitioning of the columns). For any agglomerate A, the number of its root 
suffixes, the number of its contact suffixes, the number of its columns and, if A is unsolved, the 
number of suffixes in its leading subproblem are all the same: we denote this quantity by 

Data structures. The algorithmic challenge behind agglomerates is that the total cost of refining 
each individual subproblem could be prohibitive: there is a hidden rescanning cost that we must 
avoid. As we will describe late, the refining algorithm will pick an unsolved agglomerate A, and 
process it by refining simultaneously all of its subproblems according to the ranks of its leading 
subproblem V w . This is a non-trivial task. For example, to achieve optimality (Section [6]) the 
refinement of A must be executed in time proportional to the number \V W \ (i.e ||^4||) of suffixes in 
V w plus the number of newly created subproblems, whereas A can contain many more subproblems 
and suffixes. 

The main structure for an agglomerate A is its Hasse diagram according to Q + , which is a tree 
with |A| nodes, each one representing a subproblem of A. Double links between children and parent 
are maintained. If A is unsolved, we also maintain (a) a pointer to its leading subproblem V w and 
(b) the number of nodes in the path between A J s root and V w . Only 2||A|| suffixes are stored for 
A, i.e. its \\A\\ columns. They are divided into lists, one for each contact subproblem of A: the one 
for Vj contains all the \\Vj\\ columns (cj,rj) such that T c . £ Vj (see Fig. [2]). 

We call skip node one that is a branching node (i.e. one with two or more children) or a contact 
node or the root of A (the conditions are not mutually exclusive). For each skip node Vj we 
maintain a skip link which is a double link between Vj and its lowest ancestor that is also a skip 
node. Clearly, the graph induced by the skip links is also a tree. We refer to it as A's skip tree. 
For each internal skip node Vj and each child Vj i of Vj in the skip tree, we maintain a guide link 
that goes from Vj to the highest node of A that is (i) unsolved and (ii) both a descendant of Vj 
and an ancestor of Vj t (if any such node exists). The skip tree of A with its guide links requires 
0(|| A ||) space. In Fig. [21 skip and guide links are depicted as dashed arcs and dotted arrows. 

Besides the pointers needed by the linked structures containing it, each subproblem Vi carries 
0(1) words of information: (a) its integer label £i, (b) \Vi\ and, when Vi is a contact node, the 
following additional information: (c) \\Vi\\ (d) a pointer to the list of its columns and (e) a pointer 
to the root of A. For a generic subproblem Vi, we do not explicitly maintain (e.g. in a list associated 
with Vi) either its suffixes or its ranks in 7Z(Vi). The space required to store an agglomerate A is 
just 0(\A\ +||A||) memory words. 
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We also maintain some global bookkeeping structures that are shared among the agglomerates, 
(a) For each contact or root suffix Tj, a pointer Suf f [i] to the subproblem to which it belongs, (b) A 
sorted linked list SubList of all the subproblems. (c) An array Ranks to find the nearest rank in 
1Z in constant time, so that the set of ranks TZ(Vi) of any unsolved subproblem Vi can be retrieved 
in 0<\n{Vi)\) time. 



3.3 Basic refining operations: slicing and joining agglomerates 

Before describing the general refining scheme in Section 01 we discuss some useful operations that 
operate on agglomerates. The Slice operation takes in input an agglomerate A and assumes that 
each column (cj,rj) of A has been tagged with an integer in{l, . . . , d}, for some d < \ \A\\. For any 
suffix Tj of a subproblem in A, let us denote by r(Tj) the tag of the column (cj,rj) to which it 
belongs (c,- < i < rj). Slice(j4) obtains the following without changing the columns of A: 

(a) Each subproblem V x € A is partitioned (not necessarily this gives a refinement) into new 
subproblems V Xl , . . . , V x ,, , where d' < d and all the suffixes Tj 6 V Xj have the same tag 
r(Tj), for each 1 < j < d' (after the partitioning V x ceases to exist). If all the suffixes of some 
V x £ A already have the same tag, V x remains unmodified. 

(b) All the subproblems in A, both the new ones and the unmodified ones, are distributed into new 
agglomerates Ai,...,Ad such that, for each 1 < t < d and for each suffix Tj € V Xy £ A t , we 

newly created subproblems. 



have that r(Tj) = t. Hence ]Cf=ill^ill = ll^-lli with U"=l A 



id 



Lemma 1 Slicing A into A%, ...,Ad takes 0(\\A\\ + 



Uti M 



time. 



The join operation Join(A',^4) is much simpler. An agglomerate A' is joinable with A if there 
exists a contact subproblem V x € A such that, for each suffix Tj of the root subproblem V r > of A', 
it is Tj + i G V x . Hence A'L)A is an agglomerate and, by joining with A, we have that A' disappears 
and only A remains (their trees are fused with V r i child of V x ), thus some columns are modified. 

Lemma 2 If A' is joinable with A then the join operation requires 0(\\A'\\) time. 

During the computation we may need to combine the above two operations. Let agglomerates 
A and A* be unsolved and exhausted, respectively. Also let us assume that A is joinable with A* 
and let V r be A's root. The SliceJoin(^4, ^4*) operation does the following: (i) it slices A* into 
A^i and A^, where (cj,rj) is a column of A*i iff T c ._i € V r £ A; {ii) it joins A with A*\. 

Lemma 3 SliceJoin(^4, ^4*) requires 0(\\A\\ time. 



4 Optimal Algorithm for Mult i- Select ion of Suffixes 

Initialization stage. We begin by initializing the bookkeeping data structures described in Sec- 
tion E21 Then, let MselMset be an optimal multi-selection algorithm for a multiset of items, thus 
generalizing the result in [7] to multisets (see Section [5J]). We call MselMset({T[1], . . . ,T[N]} ,K) 
using the alphabet order: this refines the suffixes in S according to their first symbols into the piv- 
otal multisets M.q, . . . ,Ft, Mt- Each A4j ^ forms a degenerate subproblem Vi since no 
ranks fall within it, whereas each JFj forms either a solved or unsolved subproblem Vj. The agglom- 
erates are initialized as singletons: each V x forms an agglomerate A x that is either exhausted (i.e. 
V x is either solved or exhausted) or unsolved (i.e. it satisfies Property [T|) . Each A x is moved into 
one of the groups unsolved_g and exhausted_g, which will be used in the rest of the computation. 
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Refinement stage. We proceed with the refine and aggregate stage. We execute a loop until the 
group unsolved_g is empty. In each iteration of the loop we pick an agglomerate from unsolved^ 
and we apply to it the Refine and Aggregate Process, shortly RAP, described in Section T4. II During 
each RAP, some temporary agglomerates will be created and they may not fulfill the characterization 
given in Section \3. 21 Thus, any agglomerate created during a RAP is to be considered as temporary 
until it disappears (since it is joined with another) or it is moved into either one of the global groups 
unsolved^; and exhausted^, at the end of RAP. 

Finalization stage. Once unsolved^ is empty, we run a finalization stage that returns the K 
wanted suffixes < Tj 2 < • • • < Ti K by suitably scanning SubList. 

4.1 The Refine and Aggregate Process (RAP) 

First step: establishing which kind of agglomerate. The first kind is when A is core cyclic: 
there is one contact subproblem V x * £ A, called the core of A, such that (a) there exists at least one 
column (cj,Tj) of A with T r .+i G V x ,, and (b) for each column (cj,rj) of A, either T Ti+ \ € V x „ or 
T ri+ \ € V yi £ A. For example, consider Fig. [21 where A = A2 is core cyclic with core V Xj[ = "Pig: the 
only two columns (cj, fj) of A2 with T ri+ \ € V19 are C21 = (132, 133) and C22 = (134, 135) , while the 
others have T ri+ i £ V Vi A2- The columns in the core V Xt are all equal and consecutive inside T. 
The second kind is when A is generic (i.e. not core cyclic) as illustrated by the example in Fig. [2j 
agglomerate A3 is generic and "acyclic", as each column (c^r*) of A3 has T Ti+ \ € V Vi A3; instead, 
agglomerate A\ is generic and "cyclic", as C\ = (53, 60) and C\q = (118, 120) have Tq\ € Vn € A\ 
and T\2\ £ V\% £ A\ . More formally, a generic and "acyclic" agglomerate A has each of its columns 
(cj, Tj) with T ri+ i 6 Py 4 A; a generic and "cyclic" agglomerate A is one where there exist columns 
(cj,rj) and \Cji,rj>) such that T^.+i 6 ?j 6 4 and T r ,+1 € TV € A but V x 7^ 7V ■ Note that 
cyclic agglomerates of the second kind are treated differently from those of the first kind. 

Second step: building the keys for the agglomerate. We assign a hey to each column of 
agglomerate A. If A is core cyclic (i.e. first kind), the key for each column (c, r) is as follows. Let 
(ci,r\) , (c2,r2) , (cf,rf) be the maximal sequence of (equal) columns of A such that r + 1 = 
c\,r\ + 1 = C2, • • • ,rf-i + 1 = Cf. The key for (c, r) is the sequence T C1 T C2 . . .T Cf T rf+ \. If A is 
generic (i.e. second kind, acyclic or not), each column (c, r) simply gets T r+ \ as key. What about 
individual suffixes? The key of each Tj is implicitly the same as that assigned to its column (c,-, r,-), 
where Cj < i < r,-. However, unlike the columns of A, we do not access each suffix of A to actually 
assign it its key, which is retrieved on the fly only when needed. Recall that at any time the total 
order < and the equivalence relation ~ are defined on the suffixes (Section 13 . X [> . Hence, any two 
keys can be compared in 0(1), since these suffixes are either contact or root, so we can use Suf f to 
retrieve their labels: if A is core cyclic, its key becomes ■ ■ ■ for labels li (repeated / times 
for T C1 , T C2 , . . . , T CJ ) and tj (for T rf+ \); if A is generic, its key becomes a single label (for T r+ \). 

Third step: multi-selection on the leading subproblem seen as a multiset. We retrieve 
the ranks rf , . . . , rfiLp of the leading subproblem V w of A, using Ranks (Section 13. 2p . We also 
retrieve all the suffixes of V w (not explicitly stored for V w ) and their keys computed in the second 
step, in 0(\V W \) = 0(\\A\\) time. Then, we call MSELMSE r r('P 1i ,,{ / oi, . . . , p^p w )\ }), where the ranks 
are pj = rj — t w for the label t w of V w , and the order among keys is the one given by < and ~. Let 
the resulting pivotal multisets be Aio,^, Ai\, . . . ,Tt,M.t-, where Tj contains the suffixes that are 
candidates for some ranks and t < \TZ(V W )\. We assign tags to the columns in A that are consistent 
with the order among these pivotal multisets, so that suffixes in the same multiset receive the same 
tag. (This is useful for the refinement in the next step.) For each < j < t, let hj be the number 
of A4 X 7^ with < x < j (some of the -Mj's may be empty). We use them to tag each column C 
of A: the tag is ho if the suffix of V w that corresponds to C belongs to A^o! the tag is j + hj-i if 
that suffix belongs to Tj, or j + hj if that suffix belongs to M.j, for some 1 < j < t. 
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Fourth step: slicing the agglomerate. We perform the slicing of A (Section I3.3|) using the d 
tags computed in the third step. We call Slice(-A) and obtain agglomerates A\, . . . , Ad, where the 
columns in each Ai have the same tag and, for i ^ j, the tag of Ai is different from that of Aa. Then 
Ai, . . . , Ad are moved into four groups: the two global ones previously mentioned, exhausted^ and 
unsolved^;, and two local ones for temporary agglomerates, called undecided_g and joinable_g, 
according to the following rule for the given Ai, where 1 < i < d: 

• if Ai's tag corresponds to a multiset Mj (see the third step), move it to undecided^; 

• else AiS tag corresponds to a multiset Ty. 

— if 1 1 -Ai| | = 1, move A{ to exhausted^; 

— else if A was generic and there is a column (cj, r\) of Ai with T ri+ \ £ V x £ A, 

move Ai to unsolved_g (since Ai satisfies Property [I]) ; 

— else move Ai to joinable_g. 

This step has a subtle point, since it implicitly induces the refinement of many subproblems 
simultaneously without paying the rescanning cost. Not only the leading subproblem V w is refined 
into its pivotal multisets using its ranks in 1Z(V W ), but each active subproblem Vi of A is refined 
as well. However, the refinement of Vi 7^ V w could be coarser than what obtained by refining Vi 
directly using its ranks in lZ(Vi): indeed, as pointed out in Section \3^2[ the a-string a w is a proper 
suffix of ai, the a-string of Vi (descendant of V w ). But Vi's induced refinement is for free since the 
cost is charged to V w , and any subsequent refinement for Vi will surely create new subproblems, 
for which we can pay (see what claimed for data structures in Section 13.21 and Lemma [T]) . 
Fifth step: processing undecided_g. The group undecided_g collects the temporary agglomer- 
ates Ai's for which there are no ranks from 1Z(V W ) falling within Ai. However, there could be other 
ranks in 1Z — TZ(V W ) that could involve Ai. Hence, (i) Ai may or may not contain unsolved sub- 
problems, and (ii) even if Ai contains unsolved subproblems, it may not satisfy Property [TJ (Here 
we may create a neighborhood from a subproblem of Ai as discussed in Section [3. II ) For each Aj 
in undecided_g, we retrieve the topmost unsolved subproblems 7\ < Vi 2 < ••• < V% d ._ 1 from Ai 
in preorder, in 0(]|Aj||) time using its skip tree (Section |3 . 2[) . Since no ancestor of Vi t is unsolved, 
we assign tag t to its columns, 1 < t < dj — 1. We assign tag di to the remaining columns of Ai, 
which are not associated with any Vi t . We call Slice(Aj) with these di tags: for 1 < t < di — 1, we 
create a new agglomerate Ai t with leading subproblem and put it into unsolved_g. We create 
a new agglomerate Ai d ., exhausted by construction, and put it into exhausted^. 

Sixth step: processing joinable_g. This step represents the "aggregate" part of RAP. For each 
agglomerate Ai in joinable_g, let Ai* be the agglomerate with which Ai is joinable (Section 13.3ft . 
Note that Ai* is either in unsolved_g or exhausted_g (but not in joinable^g, see the fourth step). 

If Ai* is in unsolved_g, or if \\Ai*\\ = \\Ai\\, we call JoiN(A.j, A^) and move the resulting 
agglomerate in unsolved_g. In this way, we maintain Property!!} if A^ is unsolved, then it satisfies 
the property; if A^ is exhausted, the leading subproblem of Ai is also viable for the agglomerate 
obtained from Join(Aj, A^) ( since || Ai* || — ||j4.j||j. 

If A^ is in exhausted_g and||Aj*|| > \\Ai\\: we call SLiCEJoiN(Aj, Aj#) producing A^\ and A«2- 
We move Ai*\ to unsolved_g as it satisfies Property [H We move Ai*2 to exhausted_g: since Ai* 
was exhausted, it did not have a leading subproblem and A^s leading subproblem is not viable for 
the resulting Ai* 2 (since \\Ai*\\ > \\Ai\\, at least one of the conditions of Property [Q is violated). 

5 Details of the Optimal Algorithm 

5.1 Multi-selection on multisets 

Consider the multi-selection problem on multisets of elements that are comparable in 0(1) time. 
The multi-selection algorithm in [7j does not exploit the presence of equal elements. Let us describe 
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our variant. Given a multiset M and K ranks n < • • • < r^, we want to partition M into its 
pivotal multisets Mo, J-\,M\, ■ ■ ■ , Tt,Mt such that the following holds: 

(i) For any < i < j < t, for any pi G J~i,ei G .Mi,Pj G Tj, ej G .Mj, we have that pi < ei < 
Pj < ej] moreover, for any eo G Mo,Pi £ ^i, we have that eo < Pi- 

(ii) The elements in T{ are equal and is the multiplicity of pi G Ti- 

(iii) For each rank rj there exists a Tj such that each in Tj has rank rj. 

(iv) For each Tj there exists a rank such that each pj in J 7 ^ has rank r^. 

Notice that t < K and jJ 7 ^ > 1 whereas there may be some Mi = 0. Let us now describe our 
algorithm. 

MselMset(.M , ri, . . . , tk) 

1. If K = exit. If X = 1 select and output the element of rank r\. 

2. Find the largest 77 < [~^r] • Then select the element p' (p") with rank 77 (r^i) and all the 
elements of M that are equal to p' (p"). 

3. Partition M into M' , P , M" , T" , M'" such that (a) J 7 ' (J 7 ") contains all the elements in M 
equal to p' (p") and (b) for any e' G M',e" G M",e"' G Af" we have that e' < p' < e" < 
p" < e'". 

4. Find the largest r a < \M'\. Call MselMset(A4', n, . . . ,r a ). 

5. Let g = \M'\ +I-7 7 ') + |.M"|. Find the smallest r b s.t. r b > q. 
Call MselMset^" U Af", r b - q, . . . ,r K - q). 

The complexity of MselMset can be expressed in terms of the sizes of the pivotal multisets of 
M w.r.t. r±, . . . ,tk- For the sake of description, we will assume that logO = 0. 

Lemma 4 The running time of the algorithm MselMset on a multiset M is upper bounded by 
c|A4|log|A4| - c|X |log|Xo| - c^- =1 + |A4;|) log d-^l +\Mi\) + c\M\ for a suitable integer 
constant c. 



Proof: From step [3] and because of the choice of p' and p" , it is clear that T' \M" and T" 
are three of the pivotal multisets. Let J 7 ' and M" be J- s and M s , respectively (thus T" is 
J^+i). Also, the algorithm never breaks any pivotal multiset and the partitioning of the in- 
put elements for the recursive calls is also a partitioning of the (yet unknown) pivotal multi- 
sets. All the non-recursive steps of the algorithm require < c\M\ time, for a suitable inte- 
ger constant c. Therefore we have that the running time is upper bounded by the function 
g(\M\) = c\M\ + g(\M'\) + g ({J 7 " U M"'\). To prove the upper bound for g we use the func- 
tion / (\M\) = c\M\ + c (\T a \ +\M 8 \) log fl^l +\M 8 \) + / <\M'\) + / U M'"\). Let us show that 
f (\M\) = g (\M\) +c£*=i <\J\\ +\Mi\) log flJ-,1 +\Mi\) + c\M \ \og\M \. Since the algorithm does 
not break A4's pivotal multisets, we know that the pivotal multisets of M' and T" U M'" are 
Mo,J 7 i,Mi,...,J r s -i,M s -i and T s+ i, M s +i, . . . , Tt, Mt, respectively. Hence, by induction, we 
have that 

/ (\M\) = c\M\+c<\T s \ +\M s \)\og(\F s \ +\M S \) + 

s-l 

+9 f\M'\) + cY, d-^l +\Mi\) log (\Ti\ +\Mi\) + c\M \ log|X | + 

4=1 
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t 

+g (\F"uM'"\)+c <\fi\+\Mi\)log<\Fi\+\Mi\). 

i=s+l 

Thus, by the definition of g, the relation between / and g is proven. All we have to do now 
is to prove that /(|A4|) < c|.M|log|.M| + c|A4|, and the wanted upper bound for g will follow 
by subtraction. By the definition of F' we know that both \M'\ and \F" U M"'\ are less than 
'4^. Thus, by induction we have the following: f <\M\) < c\M\ + c(\F s \ +\M s \)\og(\F s \ +\M S \) + 
c\M'\ + c\M'\ log\M'\ + c\F" U M"'\ +c| F" U M'"\ \og\F" U M'"\ < c\M\ + c(\F s \ +\M S \) log\M\ + 
c\M'\ + c\M'\ log ^ + c\F" U M"'\ + c\F" U M"'\ log ^ = c\M\ \og\M\ +c\M\. I 

5.2 Dealing with the agglomerates in undecided_g 

First: For each agglomerate Ai in undecided_g, we find its highest (i.e. closest to the tree root of 
Ai) unsolved subproblems (if any) and we collect them in Leadj. Since visiting the whole tree of Ai 
would cost too much (0(^1) time), we use A^s skip tree and its guide links. In this way the visit 
takes Ofll^iH) time. Specifically, we use two other lists Conj and Fj (initially empty) besides Leadj, 
starting from the root with procedure LEADVlsrr('P r ) defined as follows for a generic subproblem 
V r : 

1. If V r is unsolved, we append V r , 1 and \V r \ at the end of Leadj, Fj and Conj, respectively, and 
return. 

2. Otherwise, if V r is a contact node, we append and \\V r \\ to Fj and Conj, respectively (but 
we do not return yet). 

3. For each child V rj of V r (in ^4j's skip tree), from the leftmost to the rightmost one, we do the 
following. If V T has a guide link to an ancestor V Tx of V r , we append V rx , 1 and \V rx \ at the 
end of Leadj, Fj and Conj, respectively, and return. Otherwise, if no such guide link exists, 
we call LeadVisit^J. 

Second: For each agglomerate Ai in undecided_g, we tag its columns so that we are able to either 
(a) classify Ai as unsolved or exhausted or (b) partition Ai into some smaller unsolved or exhausted 
agglomerates. Basically, a column C of Ai has the tag I, 1 < I < | Leadj | + 1, if the subtree rooted at 
the node Vj in position / in Leadj contains the contact node with which C is associated. Otherwise, 
if no such node exists in Leadj, C has the tag |Leadj| + 1. Specifically, we compute the (inclusive) 
prefix sum SConj of Conj and set SConj[0] = 0. We scan the columns of Ai in contact visiting order 
(i.e. by using the preorder of the contact nodes in Ai). Let us consider the j-th one of them and 
let I, 1 < I < |SConj|, be the index such that SConj[Z — 1] < j < SConj[/]. The j-th column is tagged 
with I flLeadj| + 1) if Fj[Z] = 1 (= 0). 

Third: For each agglomerate Ai in undecided^, we perform the slicing with the above tags. Let 
t(i) = |Leadj|. If the tags of Ai are all equal to some I s.t. 1 < I < t{i) (to t(i) + 1), we move 
Ai to unsolved_g (to exhausted^) and the step ends. Otherwise, we call SLiCE(^4j) obtaining 
Ai 1: . . . , Ai t .~ and, possibly, Aj t( . )+1 , where A^ is the agglomerate whose columns have I as tag, for 
each 1 < I < t{i) + 1. The node in position / of Leadj is the leading subproblem of Ai v for each 
1 < I < t(i). Finally, we move Ai t{ .. )+1 to exhausted^, and all the other Ai^s to unsolved^. 

To understand why the above computation is correct, let us describe some properties of the 
tagging done. For any suffix Tj, let us denote with t(Tj) the tag of the column (c, r) such that 
c < j < r. 
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If a subproblem V r of Aj is unsolved then the following holds: (a) r(Tj>) = r(Tj»), for any two 
T^,!)/, G V r and (6) 1 < r{T f ) < t(i). 

On the other hand, if for a subproblem V r of Aj we have that conditions (a) and (b) hold, then 
on the path from V r to the root there has to be at least one unsolved subproblem whose suffixes 
have the same tag r(*) as V^s (maybe only V r itself, if it is unsolved). Also, the highest unsolved 
subproblem on said path must be the one in position r(Tjr) in Leadj. 

Given the definition of Slice in Section [3,31 the correctness follows. 

5.3 Slicing agglomerates 

As we have seen in Section [373| the slice operation receives in input an agglomerate A whose columns 
have been tagged with integers in {1, . . . ,d}, where d < Note that there should be at least 

two columns with different tags, since otherwise we do not need to run the slicing. 

During the slice operation we deal with instances of the following grouping problem: We are 
given a list L of objects, each with an integer tag in{l, . . . , d, d + 1} (where the tags of the columns 
are d but during parts of the slicing we will need an extra tag for special purposes). We want to 
partition L into d' < d + 1 lists L^, . . . ,Li d , such that an object o G Lj iff o's tag is j, for each 
j G{«i, . . . , id'} . . . , d, d + 1}. Note that the order in which the lists L^, . . . , Li d , are produced 
does not necessarily have to follow the order of the tags. Also, the problem is easy if \L\ = Q(d) 
since it falls within the radix sort scheme. In our case, we can spend 0(d) preprocessing time 
and space beforehand: after that, for each instance L of the grouping problem, we assume that 
\L\ = o(d), and we cannot pay 0(d) time but just 0(|L|) time. 

The procedure Group solves the above problem as follows. Between one call and the other, 
it reuses the same array J of d + 1 slots that is allocated at the beginning of the slice operation 
and is never reset from one call to another. Each slot J[i] has two fields: J[i].p, a list pointer, and 
J[i].i, an integer. Let r\ be an integer timestamp unique for each call (since we can just maintain an 
increasing integer throughout all the calls to Group). While we scan L, we build a list L' of lists 
and then return it. Let c be the tag of the current object o. During the scan we have two cases: 
(i) if J[c].t ^ r], we set J[c].t = r\ and start a new list L c with o as first element; also, we append 
L c to L' and set J[c].p to point to L c ; (ii) if J[c].t = rj, we append o to the list pointed by J[c].p. 

Lemma 5 After 0(d) preprocessing time and space, each call to the procedure Group requires 
0(\L\) time. 

At this point, we can describe Slice(j4), which has three main phases: pruning, slicing, and 
finishing. 

5.3.1 Pruning phase 

The goal is to identify some relevant nodes in the tree reprensenting the agglomerate A. Speficically, 
a node Vi G A is homogeneous if all the columns associated with all the contact nodes in the subtree 
rooted at Vi have the same tag. We want to find the set Pruned of the highest (i.e. closest to the 
root) homogeneous nodes of A and tag each one of them with the common tag of their columns. 
Let us recall (Lemma [1]) that we need an implementation of Slice with a time complexity that is 
of the order of the number of columns of A plus the total number of the new subproblems created. 
Hence, we cannot touch all the homogeneous nodes of A since they will not be refined into new 
subproblems at this stage. If we are able to find the set Pruned efficiently then, later on, each 
subtree of A rooted at a node in Pruned will be just linked to the corresponding new agglomerate 
without accessing any of its internal nodes. The pruning phase proceeds with the following steps. 
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First: We traverse the skip tree of A level by level from the bottom one. 

• For each V\ at the lowest level we do the following, (i) Since V\ is a contact node (and a leaf), 
we call Group on its (tagged) column list, (ii) Then we tag V\ with c (resp., with d + 1) if 
from Group we get only one list L c (resp., we get more than one list). 

• For each Vi at a generic level (different from the lowest) we do the following, (i) If Vi is a leaf, 
we perform the same steps as we did at the lowest level, (ii) Otherwise, if V% is an internal 
node, we call Group on the list of objects, where each object is one of V^s children (and 
their tags); if Vi is also a contact node, we add more objects to the list, where each object is 
one of ViS columns (which already have tags). (Hi) Then we tag Vi as we did at the lowest 
level. 



Second: We traverse ^4's skip tree with the following recursive SKIP VISIT (V r ), starting from the 
root. SKlPVlSlT('Pj) is defined for a node Vi in the skip tree as follows: (i) if TVs tag is < d + 1, 
we output a pointer to Vi and return; (ii) otherwise, if the tag is d+ 1 we call SKlPVlSIT('Pi j ), for 
each child (in the skip tree) Vi J of V%. 

Third: We are ready to retrieve and mark the nodes to be added to Pruned. (Note that the root 
V r cannot belong to Pruned.) For each node Vi outputted in the previous step, we proceed as 
follows: 

(i) We retrieve Vi's parent in the skip tree Vj. 

(ii) If Vi is the x-th child of Vj in the skip tree of A, we find Vj's x-th child in the tree of A, say 
Vj x : note that Vi is a descendant of Vj x in the tree of A (they could be even the same node 
in some cases). Then we mark Vj x (since we add it to Pruned) and tag it with TVs tag. 

(iii) If Vi and Vj x are not the same node, we do the following, (a) We create a temporary skip 
link between them (for the next phase — the slicing phase), (b) Let v be the ancestor node 
of Vi pointed by the corresponding guide link of Vj (if any). For the sake of clarity, observe 
that traversing the tree of A from its root to Vi, we meet Vj, Vj x , v, and Vi- We create a 
temporary guide link between Vj x and v (they may be the same node in some cases). 



Fourth: Let y be the set of nodes of A that are not in a subtree rooted at a node belonging to 
Pruned. We assign to each Vi € y a fingerprint that is a unique integer from {1, . . . ,\y\} (any 
choice would do, for example the DFS numbering). 



Lemma 6 The pruning phase requires 0(||j4|| + 



Uti Ai 



A 



time. 



5.3.2 Slicing phase 

The goal is to actually slice the agglomerate A into the agglomerates defined by the columns' tags, 
as claimed in Lemma [U with some provisions. Indeed, the only missing things to complete this 
task are the following: (a) the link between each contact node and the root; (b) the contact node 
list; (c) the correct labels and the correct ordering in SubList for the new subproblems created; 
(d) the guide links of skip nodes that are not descendant of any node in Pruned. We will deal with 
these things in the next phase — the finishing phase. 

We invoke the recursive procedure SliceRecQ on the root of A. A generic call SliceRec^j.), 
where V r indcates now a generic node in A, has three cases. 
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V r is a leaf. First, we call GROUP on TVs column list and obtain df < d+ 1 lists L ri , . . . , L r i . 
By scanning each L n we obtain (a) all the distinct tags ti,...,t^ and (b) n%,..., n$ where is 
the number of columns in list L r .. 

Second, we create d! new subproblems V ri , . . . TV/ and insert them into SubList in place of V r . 
If V r was ^4's leading subproblem, we set V ri ,.. . TV, to be leading subproblems of their respective 
agglomerates (they might not actually be, see Section HT2]) . For each 1 < i < dl: (a) we set TV's 
tag and fingerprint to t$ and TVs fingerprint, respectively; (6) we set \V n \, TV' 8 column list pointer 
and l n to rii, L n and £ r , respectively. Finally, we eliminate V r and return V ri ,V r2 . . -Vr d ,- 

V r is a contact node but not a leaf. If V r G Pruned, we return it immediately (it has been 
tagged in the previous phase-the pruning phase). Otherwise we proceed as follows. 

First, we call SliceRec(TV)) for each child V n of TV From each call SliceRec(TV) we receive 
a set Qi of root nodes. We call Group on the list with the objects in C U Q\ U • • • U Q x , where C 
contains TVs columns. This produces d! < d+ 1 lists L ri , . . . ,L r '. By scanning the lists, we obtain 
(a) all the distinct tags ti,...,td' and (6) pairs {col\,nod\) , . . . ,{cold> , nodd'} , where lists co/j and 
nodi contain all the columns and all the nodes in L n , respectively (some of them may be empty). 
Then by scanning each co/j and nodi we obtain (c) ni, . . . , n^, where nj is the number of columns 
in coli , and (d) p\, . . . , pd> , where p, is the total number of suffixes in each subproblem in list nodi . 

Second, we create dl new subproblems V n , ■ ■ ■ Vr d , and insert them in SubList in place of V T . 
If V r was A's leading subproblem, we set V TX , . . . V r to be leading subproblems of their respective 
agglomerates (same as the above case when V T is a leaf). For each 1 < i < d!\ (a) we set "P^'s tag 
and fingerprint to ti and "P r 's fingerprint, respectively; (6) we set \V ri \, V Ti column list pointer (in 
case V n is a new contact node) and l Ti to rij + pi, coli and £ r , respectively; (c) we make the nodes 
in nodi be V^'s children. 

Third, for each V Ti and each child V n - of V Ti we do as follows. 

(i) If V n is not a skip node but its only child V rij is, we create a temporary link between them. 

(ii) If neither V Ti nor V Ti - is a skip node, we redirect to V n the (only) temporary link that goes 
into V ri] . 

(iii) If V ri is a skip node and V Tij is not, we redirect to V n the temporary link that goes into V Tij , 
and we change it into a skip link. 

(iv) If both V Ti and V rij are skip nodes, we create a skip link between them. 
After that, we eliminate V r and return V ri , V r2 ■ ■ ■ *Pr d , ■ 

Vr is not a contact node. This case is analogous to the previous one, only simpler, because V r 
is not a contact node and no new contact nodes can be created from it. 



Lemma 7 The slicing phase requires 0(||j4|| + 



5.3.3 Finishing phase 



Uti A* -A 



time. 



The finishing phase adds the missing information from the previous phase — the slicing phase. It 
proceeds with the following steps. 

First: We sort all the new subproblems TVs according the keys {rrij,tj), where vrij and tj are TVs 
fingerprint and tag, respectively. Since each (nij,tj) has 0(log| A|) bits, we can use radix sort. 
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Second: After the sorting, for each old active subproblem Vi, we have that new subproblems 
V% x , ■ ■ ■ , Vi x are grouped together and in their correct relative order. Hence, we can reattach them 
in SubList in the correct order and also set the correct label l{. for each one of them. If Vi was 
inactive, we leave 7\ , . . . , Vi x and their labels as they are. 



Third: For each new agglomerate Ai we build its contact node list by visiting its skip tree. Then 
we scan its contact node list and for each contact node we set the link to A^s root. 



Fourth: For each new agglomerate Ai we need to create the guide links for the skip nodes of Ai 
that are not descendants of the nodes in Pruned. To that end, we call Guide Visit (7^), defined as 
follows, on Ai's root. 

(i) For each child Vi- of Vi in A's skip tree, we scan the nodes 7\, of A's tree that are both 
descendants ofVi and ancestors ofVi j starting from the highest one. We keep scanning them 
until we find a Vi x that falls into one of three cases: (a) it is unsolved, (b) it is in Pruned or 
(c) it is Vi ■ In case (a) we create a guide link between Vi and Vi x . In case (b) if V% x has a 
temporary guide link (possibly created in the pruning phase) to a node v, we create a guide 
link between Vi and v. Otherwise no guide link is created. 

(ii) We call GuideVisit^T^ .), for each child Vi j of Vi in A's skip tree. 
Lemma 8 The finishing phase requires o(\\A\\ + |u!i 4. -a\) time. 



5.4 Joining agglomerates 

Given an agglomerate A' that is joinable to A, we need to attach the root of A' to a suitable place 
in A. Let V x € A be the contact subproblem such that Tj + i £ V x , for each suffix Tj of the root 
subproblem V r > of A'. The operation Join(^4', ^4) (see Section I3.3() proceeds with the following 
steps. 



First: We fuse the trees of A and A' by making V r ' be the new leftmost children of V x . 

Second: Let L and V be the contact node lists of A and A 1 , respectively. Let p and s be the 
predecessor and successor of V x in L, respectively. Let b and e be the leftmost and the rightmost 
node in L' , respectively. If after the fusion V x is not (is still) a contact node, we link p (V x ) to b 
and e to s. 



Third: Let us define v as follows: (i) if V x is still a skip node after the fusion (it may not be a 
contact node anymore), then v is V x ; (ii) otherwise v is the ancestor of V x pointed by its skip link. 
If V r ' is a contact or branching node, we create a skip link between v and V r '- Otherwise V r > is 
not a skip node of the final agglomerate. Hence, the only skip link from a node in A' to V r > is 
redirected to v. We do the same with the only guide link of V r >. 

Fourth: Each column (c^, r^) of A' is changed into {c' i} rj) where {cj,rj) is the column of A associ- 
ated with V x such that r • + 1 = Cj. The column (cj,rj) is deleted since T Cj is not a contact suffix 
anymore. 

Fifth: We set the root pointer of each contact node of A' to A's root. 

Since the total number of skip links, columns and contact nodes of A' and deleted pairs of A is 
Ofll-A'H), Lemma [2] is proven. 
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5.5 Slicing and joining with an exhausted agglomerate 

Let us now describe the Slice Join(t4, A*) operation used in the last step of rap and whose effect 
has been described in Section 13.31 Let A and A* be unsolved and exhausted, respectively, and let 
us assume that A is joinable with A*. Let V x be the contact node of A* such that Tj+i 6 V x , for 
each suffix T{ of the root subproblem V T of A. 

For the slicing part of Slice Join, we do not explicitly tag A*'s columns but, conceptually, we 
would have the following: (i) only two tags (1 and 2) for the columns; (ii) V x is the only contact 
subproblem of A* with some columns with tag 1, and all the other columns with tag 2. Because 
of that, we know that the only subproblems of A* that are partitioned during the slicing part of 
SliceJoin are V x and its ancestors. Thus, we do not need the pruning phase of Slice because V x 
is the only non-homogeneous node. 

The slicing phase is the same as in Slice except for two things. First, the recursion does not 
touch any node that is not on the path from the root of A* to V x (since they are implicitly in 
Pruned). Second, when we treat V x (which is a contact node) we do not touch its column list at 
all, we just create the two subproblems V X1 and V X2 , and we link the whole column list of V x to V X2 
(which will be part of A*2 and is still a contact node). We can leave in the column list of V X2 all 
those columns that, after a normal Slice of A*, would end up being associated with V Xl without 
incurring in any trouble for two reasons: (i) after the join of A and A*i they would disappear 
anyway, and (ii) A*2 is exhausted and its contact nodes are not active anymore. 

We also do not need the finishing phase of Slice, since all the subproblems in A* are exhausted 
and A will join with A*i after the slicing part of SliceJoin. 

Because of the characteristics of A*±, to join A with it we just (i) link A's root to the only 
contact node of A*i and (ii) change each column (cj,rj) of A to {cj,rj + I), where I is the number 
of nodes A*\ (whose tree is just a path). 

Thus, Lemma [3] follows as a corollary of Lemma [TJ 

5.6 Finalization stage 

We finally have to store into SubList all the K wanted suffixes. Note that we need to retrieve 
them from the columns that contains them. To this end, we output the set of K pairs RkSuf f = 
{( r «> j) \ r i ^ 7£ and Tj has rank r^} in the following way. We scan SubList and for each subproblem 
V x that is both solved and a leaf we do as follows. First we add (r',j x ) to RkSuff , where r' and 
Tj x are V x s only rank and only suffix. Then we retrieve all the ancestors of V x (in the tree of its 
agglomerate) that are also solved. They are all the nodes closest to V x in its leaf-to-root path. Let 
V Xy be the y-th closest one of them, we add {r y ,j x + y) to RkSuff, where r y is the only rank of V Xy 
(and Tj x+y is clearly V X yS only suffix). 

Lemma 9 The finalization stage requires 0(N) time. 

6 Correctness and Analysis 

Correctness and complexity are strictly related, so we discuss them together, here we give the 
lemmas needed to prove the theorems stated in the Introduction, and few proofs. We devote 
Section [7] to the remaining proofs. 

Consider first a simplified scenario where we have to perform multi-selection on a prefix-free 
set Z of N independent strings of total length L, using our set 7Z of K ranks. We adopt the same 
notation as in formula (UJ and Section 

We run MselMset(2, 1Z) on the first symbol of all the strings in Z. This partitions the strings 
into unsolved, solved and exhausted subproblems: Vi is unsolved when it contains all the strings 
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with the same first symbols and \ Vi\ > \1Z(Vi)\ > 1; or, V% is solved when \ Vi\ = \lZ(Vi) \ = 1; finally, 
if V% is exhausted then \7Z(Vi)\ = and the first symbols of its strings may not be the same. We 
repeat the refining steps until there are no more unsolved subproblems. We pick any unsolved 
subproblem V%. Let dj be the length of the common prefix (of its strings) examined so far: we 
invoke MSELMSE r r('Pj, lZ(Vi)) using the alphabetic order on the symbols y[<2j + 1] for y G V{. Thus 
we refine V% into smaller subproblems, classify them as described above, and repeat the steps. 

Lemma 10 The running time of the multi-selection algorithm with K ranks for a prefix-free set of 
N independent strings of total length L is upper bounded by O log N — ^jLo A? 1°§ + N + . 

Here we focus on the multi-selection for our set S of N suffixes, and show how to consider them 
clS cl set of N independent virtual strings. 

Lemma 11 The running time of the suffix multi- selection algorithm for a text of length N is upper 
bounded by O (jSf log N - J2f=o A j lo S &j + N + 

rap ) , where rap is the total time required by all 
the RAPs minus the time for the MselMset calls. 

Proof: Consider the computation described in Sections [3] and [U and the virtual symbol cost of the 
calls to MselMset: when applied to a subproblem V w , it performs comparisons using a certain 
order on V w 's keys, which can be seen as the virtual symbols of independent strings. Indeed, these 
virtual symbols are exclusively created and "used" for V w . Unlike T's symbols, virtual symbols are 
not shared by subproblems. Each V w has associated \V W \ virtual strings that are made up of all the 
virtual symbols created to refine V w during several RAPs, every time V w is the leading subproblem 
of its current agglomerate. And, unlike the suffixes of T, all the virtual strings are independent. 

Let L be the total number of virtual symbols thus created by all the raps. Lemma flOl reports 
the virtual symbol cost for all the MselMset calls, that is, their total contribution to the final cost 
of our multi-selection algorithm. We have to add the cost of the rest of the computation, which is 
O(rap) by definition. It remains to show that L = O(rap), thus proving the claimed bound. 

When MselMset is applied to V w , let A be the agglomerate for which V w is its leading 
subproblem. The number \P W \ of virtual symbols created in this call satisfies \P W \ = \\A\\- Since the 
rap computation time for this step (minus the call to MselMset) is 0(]| ^4.|| ) (e.g. Lemmas [TJ- [3J , 
this computation time is an upper bound for \V W \. Summing up over all the raps, we obtain that 
the total number L of virtual symbols thus created is upper bounded by the computation time of 
all the raps minus the MselMset calls, namely, L = O(rap). I 

Lemma 12 The running time of the suffix multi-selection algorithm for a text of length N is upper 
bounded by 0(K log K + N + rap) when 1Z is an interval of K consecutive ranks. 

We need an intermediate stage to find the suffixes of ranks r\ and rjc, and create a subproblem 
with the remaining K — 2 ones. After that, we run our multi-selection. 

Lemma 13 Independently of the choice of the ranks in 7Z, rap = 0(N). 

Proof: We give an analysis based on counting the following types of events: (a) Subproblem creation: 
when some Vj is partitioned into 7\, . . . ,Vi p (during a Slice or at the beginning of a Slice Join). 
(b) Suffix discovery: when some T w G V w is recognized as one of the wanted suffixes with rank in 
7Z (third step of rap), (c) Suffix exhaustion: when some suffix T e G V w G A becomes exhausted, 
where V w is the leading subproblem of A (third step of rap), (d) Column fusion: when a column 
of A' is fused with one of A (during Join (A', A) or at the end of a SliceJoin). (e) Inner collision: 
when processing a column (cj, r^) of A such that (cj,rj) is also in A and satisfies Cj =n + l, while 
T Ci and T Cj do not belong to the same subproblem (second step of rap). 
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Claim: There are overall O(N) events occurring in any execution of our algorithms. Indeed, 
the same event cannot repeat. A column is never divided and a subproblem is never merged with 
others. If a suffix is exhausted or a wanted suffix is found, they can never be in an unsolved 
subproblem again. An inner collision is unique, since the two colliding columns will not be part of 
the same agglomerate. Since we start with TV" columns and one subproblem, the total number of 
events is O(N). 

For a generic RAP on an agglomerate A, let Na be the number of events thus occurring. Since the 
overall number of events is O(N), this implies that YIa ^A = 0(N). We show that Na > \\A\\ +IL4, 
where Ha is the number of created subproblems. Consider events (a), and observe that their contri- 
bution is IF4, which is the sum of three quantities: | uf =1 Ai — A\, where A\, . . . , A^ are the agglom- 
erates into which A is sliced in the fourth step; X]{A i eundecided_g} I U jLi^ ~~ ^»|> where A^ , . . . , Ai d , 
are the agglomerates into which each A{ is sliced in the fifth step; X^{Ai6siicejoinabie g} 

|-Ai*i|j where 

slice joinable_g denotes all the Aj's in joinable^g that need a SliceJoin in the sixth step. 

As for events (b)-(e), they totalize at least ||t4|| in number and occur in the fourth step. Namely, 
the number of (b)s and (c)s is at least the total number of columns of A^s that go to exhausted_g or 
to undecided^: each column of each Ai moved to undecided_g corresponds to (c), and each column 
of each A% moved to exhausted^ corresponds to either (b) or (c). The number of (d)s and (e)s is 
at least the total number of columns of the A^s that are moved to joinable^g and unsolved_g, 
respectively. Since the total number of involved columns in the fourth step is 5Zi=ill^MI = 11-^11 > we 
obtain the claimed number. 

At this point, to prove rap = O(N), it remains to see that the cost of a generic rap (MselMset 
excluded) on an agglomerate A is 0(||A|| +ILa) time. The first three steps of the RAP take 0(||A||) 
time. The costs of the fourth and fifth steps are given by Lemma[TJ precisely, 0(||j4|| + | uf =1 Aj — A\j 
and 0(I]{ J 4 1 eundecided.g} (11^*11 + I u j=i A ij ~ A i\))- B y Lemmas [2] and O the total cost of the 

Sixth Step is 0(Z]{A i ejoinable_g}ll^ll + Z]{AiGslicejoinable.g} 1^**1 1)' Since Z]{Aieundecided.g}ll^ll + 

E{A sej oinabieg}ll A J < Ul the total cost is 0(\\A\\ +U A ) and rap = OC£ A (\\A\\ + U A )) = 
0(E A Na) = 0(N). I 

For any two subproblems V% and Vy such that Mi ^ My, let lcp(Vi,Vi') be the length of the 
longest common prefix of any two suffixes Tj i € V% and Tj, t € Vy. We have the following: 

Lemma 14 The suffix multi-selection algorithm can support lcp(Vi,Vi') queries in 0(1) time, for 
any two Vi,Vi' such that Mi ^ My, without changing its asymptotic time complexity. 

Proof: Consider a snapshot of the computation, recalling that we maintain the sorted linked list 
SubList of all the subproblems (where some of them belong to the same neighborhood). Suppose by 
induction that we have the lep's between consecutive subproblems. We use a Cartesian Tree (CT) 
plus lowest common ancestor (LCA) dynamic queries to compute the lep for any two subproblems. 
We maintain the induction at the end of the Slice operation: when Ai is sliced into A^, . . . ,Ai d , 
the newly created subproblems V X1 , ■ ■ ■ ^x d i for each A^. are stored contiguously in SubList. We 
compute lcp(V Xl ,V X2 ), . . . , lcp(V x , , V x , ) using using their keys, the CT, and the LCA queries, 
in 0(d!) time, which we can pay for. Since these Icp^s are longer, we just need to add df — 1 new 
leaves to CT, without increasing the asymptotic complexity. I 

7 Proofs 

7.1 Multi-selection on a set of strings 

In order to prove the upper bound for the time complexity of our suffix multi-selection algorithm, 
we first need to prove the complexity of the following algorithm for multi-selection of independent 
strings (i.e. not sharing symbols). 
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We have a set Z of N independent strings of total length L such that none of them is the prefix 
of another. We want to select the K strings with ranks in 1Z = {ri, . . . , rx}, where r\ < ... < rjc- 
Let r and r^+i be and N + 1 respectively. 

In our multi-selection algorithm for independent strings a subproblem V% is a subset of Z 
associated with (a) TZ(V{) Q TZ, (b) lessi, the number of strings in Z that are less than each one in 
Vi and (c) an offset off^ £ {1, . . . , l{}, where l{ is the length of the shortest string in V%. 

1. We put in unsolved^ the hrst subproblem Vq = Z where 7Z(Vo) = 7Z, lesso = and off = 1. 

2. We repeat the following steps until unsolved^ is empty. 

(a) We pick a subproblem Vi in unsolved_g and we call MselMset(.M, where M = 
{ s \°ffi\ I s e ^i} an d = i r j ~ l ess i I r i € ^("Pj)}. This partitions .M into its pivotal 
multisets Mo, . . . , F t ,Mt- 

(b) For each J^-, we collect the set 7^ C of the strings corresponding to the symbols 
in Fj, and we set offi. = off { + 1, fess^. = \M \ + YZZil^x U M x \ and 7£(7\) = 
{r y £ U(Vi) | ZesSy < < less i;j +\Vi j \}. 

(c) If \Pij | = 1 we move Vi 4 to sol_g. Otherwise, we move Vi- to unsolved_g. 

3. For each V% in sol_g, we output {si,n), where Si £ Vi and r% £ 1Z(Vi). 

Lemma 15 The running time of the multi- selection algorithm for a set Z of independent (and 
prefix free ) strings of total length L is upper bounded by 

c log N - ^ Aj log Aj + N + 5L' j 

where c is a suitable integer constant, ro = ; r^+i = N + 1, Aj = r J+ i — /or < j < A', and 
V < L is the smallest number of symbols we need to probe to find the wanted strings. 

Proof: The offset size of a subproblem V% is the total length of the strings in {s[q/fj • • -|s|] | s £ Vi}. 
It is easy to see that the computations on distinct subproblems proceed independently from one 
another. Hence, we prove the thesis by induction on the offset size of subproblems. 

Let us consider subproblem Vq (step [Q) , its corresponding multiset M. and the pivotal multisets 
of M computed in step [21 Because of Lemma HI after the first execution of step [2a| the terms 
cN log N and cN of the wanted upper bound are accounted for. 

Let us first deal with the pivotal multisets of Ai such that = 1. They correspond to 
solved subproblems and they are not involved in step [2] any longer. Let us consider all the maximal 
consecutive groups of them. Let one such group be T% , .Fi 9 +i, • • • ,Fi g+z , for some Ti g . We know 
that J~i g is associated with just one rank n , , g' > g and the only string in the corresponding 
subproblem is the one of rank r« g , in Z. Analogously, Fi g+ \ is associated with just one rank 
and so forth up to Ti g+z -i included. Hence, for any such Ti g+W , we have that LFj +w \ +\Mi g + w \ = 
Ai gl+W . Therefore, because of Lemma [U the term — cAj log A* ,+ w of the wanted bound is 
accounted for. What we have said so far does not hold for J~i g +z (unless it is the rightmost one of 
all the pivotal multisets T*). Let us deal with these cases later. 

Let us now consider any pivotal multiset T a created in the first execution of step [2] and containing 
more that one symbol. Let V Zo the subproblem corresponding to T . Let r p ,r p+ i, . . . ,r x be the 
ranks corresponding to T and let N Q be the number of symbols in A4 less than the ones in T . 
Since V Zo has offset size less than the one of Vq, by inductive hypothesis we know that the algorithm 
retrieves the strings of Z with ranks r p , r p+ \, . . . , r x using time 
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x-1 

c\T \ loglJ^I + c\T \ + c5L' Q - Ajlog Aj+ 

j=p 

-c(r p - N ) log(r p - N Q ) - c(\T \ - r x + N a + 1) logfl^l - r x + N Q + 1) 

where L' Q + \T > Zo \ is the smallest number of symbols we need to probe to find the wanted strings 
amongst the ones in V Zo (since off Zo = 2, i.e. the first symbol of each string in V Zo can be skipped). 
Let us call —c{r p — N Q ) log(r p — N Q ) the left inductive term and the two terms c|J" |, c5L' Q inductive 
remainder terms. We will deal with them later. 

By Lemma [H we have term — c (| J^o | + |.M |) log (\F \ +\M. \) from the first execution of step I2"al 
We can upper bound that term by the two-term expression — c| T \ log | T \ — c\AA Q \ \og\M. \- The 
first term — c\J- Q \ log| J- \ can be cancelled with the opposite term in the expression given by the 
inductive hypothesis. 

For each rank rj € {r p ,r p+ i, . . . ,r x _i} the term — cAj log Aj of the wanted upper bound is 
accounted for, since it is in the expression given by the inductive hypothesis. Amongst the ranks 
corresponding to \T \-> r x is the one with which we have yet to deal. 

Let us first assume that |.F +l| = 1- If that is true, we know that iV* + |.A/f | = A^, where we 
denoted with iV* the number of strings in V Zo with rank (in Z) greater than or equal to r x . But 
N* = \T \ — r x + N + 1. Hence, to obtain the term —cA x log for the wanted bound, we need 
to mix the terms — cQ.F | — r x + N a + 1) logdJ-'ol — r x + N a + 1) (from the inductive hypothesis on 
V Zo ) and — c|A4 | log|.Mo| (from the first execution of step l2al) . 

It is easy to prove that for any a,b > 1, (a + b) log (a + b) > aloga + 61og6 + 2(a + b). Thus, 
we have that 

-cAUogiV* - c|A^ |log|7W | < 

-c(iV* +|A^ |)log(iV* +\M \) + 2c(N* +\M \) = -cA^log A x + 2cA x . 

Thus, we have now obtained the — cA x log A^ term for the last rank r x of T Q for the case |^o+i I = 1- 
However, we still have to deal with the extra 2cAa;, let us call it the extra remainder term. 

Let us now consider the case |.F +i| > 1- Thus we can use the inductive hypothesis and in this 
case F +\ contributes to the bound a left inductive term —c{r' p — N' D ) log(r p — N' Q ) (analogous to 
the —c(r p — N ) log(r p — N a ) for J-~ ), where r' p is the smallest rank associated with F +\ and N' Q is 
the number of symbols in A4 less than the ones in J- +i- The term —cA x log A x is obtained in the 
same way we did before, only this time we have to mix three intervals instead of two, since in this 
case A x = N* + \M \ + ( r ' p — N' D ). Hence, in this case the extra remainder term is 4cA x . 

Before we account for all the remainder terms, we still need to deal with the rightmost multiset 
of each maximal consecutive group of those multisets J 7 * such that \J-*\ = 1. The (single) ranks 
r u of each one of these rightmost multisets are the only ones for which we have not yet obtained 
the term — cA M log A u of the wanted upper bound. For any such multiset Fi with rank r[ (i < i'), 
— cA'i log A'- can be obtained in the exact same way we did above and hence we have an extra 
remainder term for each one of these ranks too. 

Finally, let us account for the remainder terms. For each T , we have at most three kinds 
of remainders: cjJ-'ol, c5L' Q and a third kind that is at most 4cA x , where r x is the largest rank 
associated with T a and L' Q + \V Zo \ is the smallest number of symbols we need to probe to find 
the wanted strings amongst the ones in V Zo (if any). Overall, the first and third kinds are upper 
bounded by c5N. By the definition of the second kind of remainder and since all the N symbols 
in M. need to be probed to find the wanted strings, we have that c5N + X^fjlfJM} = c5L'. 
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7.2 Proof of Lemma [T] 



Let us consider the pruning phase. In the first step, the level-by-level visit of the skip tree of A can 
be easily done in Ofl|.A||): the skip tree contains only contact and branching subproblems, hence 
its has 0(||^4||) nodes; also, by LemmaO all the calls to Group have a total cost which is linear in 
the number of columns of A, which is \\A\\. About the second step, the cost of SkipVisit on the 
skip tree is clearly 0(||t4||). So is the cost of the third step where we do an 0(1) amount of work 
for each node in Pruned. In the fourth step we do 0(1) amount of work for each subproblem that 
is not in a subtree rooted at some V Wi € Pruned. By the definition of Pruned we already know 
that each one of the subproblems we touch in the fourth step will be later partitioned into two 



or more smaller subproblems. Thus they are certainly less than 



U J=1 Ai-A (which is the total 



number of new subproblems that are created by the slicing of A). Thus the pruning phase takes 



O^AW+^^Ai-A 



time. 

Let us consider the slicing phase. First of all, SliceRec performs a depth first visit of the 
tree of A that whenever encounters a node in Pruned it does not go any deeper. Thus, the total 
number of nodes visited is equal to the number of subproblems that are partitioned into two or 
more subproblems by the slice operation. 

Let us consider the internal nodes of A. For any such V r , in the first step, after the recursive 
calls to SliceRec have returned, we call Group on a list containing \C r \ + Yll=i sub Ti objects, 
where C r is the set of contact suffixes of V T (if it is a contact node) and sub n is the number of 
subproblems into which the i-th child of V r has been refined into (during the recursive call). We 
charge each sub n term to the corresponding child. The second step costs 0(sub r ) time. The third 
step requires O(l) amount of work for each one of the children of each new subproblem into which 
V r has been partitioned: we charge any such 0(1) amount of work the the corresponding child. 
Thus, the total cost for an internal node V r of A (including the costs charged to V r by its parent) 
is 0(sub r ). 

For each leaf V T of A, the total work we do is of the order of the number of columns of V r (by 
Lemma [5]). The amount charged to V r by its parent is of the same order. Thus, overall the cost for 

all the leaves of A, the internal nodes of A and the new nodes of each Ai is O (\\A\\ + Uf=i Ai — A^j . 

Finally let us consider the finishing phase. In the first step we can use radix sorting and thus 
the first and second steps take O ^ |jf=i Ai — A \ time. The third step accesses each skip node of 
each new agglomerate O(l) times, thus costing 0(||^4||). The fourth step, accesses at most all the 
nodes that are not descendants of any node in Pruned. By using Ranks, each access to establish if 
a node is unsolved takes 0(1) time. Thus the total cost of the fourth step is O^ UiLi Ai — A 

7.3 Proof of Theorem [3] 

At this point, Theorem [3] follows directly from the following Lemmas 1 1 6 1 and [T71 whose proofs detail 
some of the ideas presented in Section [6j 

Lemma 16 The running time of the suffix multi- selection algorithm for a text of length N is upper 
bounded by O (n log N - J2f=o A j lo g A j + N + 

rap I , where rap is the total time required by all 
the raps minus the time for the MselMset calls. 

Proof: The cost of the initialization stage is dominated by the call to MselMset (building Ranks 
takes 0(N) time). By Lemma [U we know that the cost of that call is within our target bound. 
After the Refine and Aggregate stage, the cost of the finalization stage (Section |5.6|) is O(N). 

Let us now account for the contribution of the calls to MselMset to the total cost. Instead 
of the normal time cost, let us consider the virtual symbol cost of the calls to MselMset: a call 
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to MselMset costs or (conceptually) creates x virtual symbols if the multiset that it receives 
in input has x objects. The virtual symbols created by the call to MselMset for some leading 
subproblem are exclusively created and "used" for that subproblem. Thus, unlike T's symbols, 
virtual symbols are not shared by subproblems. Naturally, virtual symbols form virtual strings: 
each subproblem V\ has associated \ Vi\ virtual strings that are made up of all the virtual symbols 
that will be created during the computation to refine Vi every time it is the leading subproblem of 
its current agglomerate. And, unlike the suffixes of T, all the virtual strings are independent. 

Picking an unsolved agglomerate A and refining it with a RAP can be seen as a two-part process: 
(a) picking an unsolved subproblem, the leading subproblem of A, and refining it with the call to 
MselMset; (6) refining all the other unsolved subproblems of A with the rest of the RAP (mainly the 
call to Slice). Thus, if we take aside all the (b) parts of the RAPs and if we consider the subproblems 
to be subsets of virtual strings, then the suffix multi-selection algorithm behaves exactly like the 
multi-selection algorithm for independent strings in Section 17.11 Therefore, by Lemma [T5l we 
have that the cost of the algorithm is O (^N log N — J^jLo Aj 1°§ Aj + N + L + rapj , where ro = 0, 
rx+i = N + 1, Aj = r,-_|_i — rj for < j < K, L is the total number of virtual symbols (and rap is 
the total cost of the (6) parts of the rap)s. 

Let us evaluate L. As we have seen above, a RAP an agglomerate A creates a number of new 
virtual symbols that is equal to the cardinality of the multiset passed to MselMset (equal to 
Since it is not computed during MselMset call, the cardinality of such multiset must be 
0{rap{A)) (where rap(A) denotes the total cost of the RAP on A minus the cost of the MselMset 
call). Therefore L = O(rap). I 

Lemma 17 rap = O(N). 

Proof: Let us first evaluate the total cost of the steps of a RAP on an agglomerate A minus the 
cost of the call to MselMset in the third step. 

About the first step, to verify which kind of an agglomerate we are dealing with (generic or core 
cyclic) O(l) scans of the contact node list of A (while using the Suf f structure) are enough. Thus, 
the first step takes 0(||^4||) time. 

About the second step, the slightly more complex case is when A is core cyclic. In that case 
we first find find the columns (cj, rj) such that T Cj ^i £ 7-j A, then their keys and from those we 
produce the keys for all the other columns. As we noticed, we do not actually access each suffix 
of A to assign it its key. We do that only with A's columns. Thus, the second step takes 0(||A||) 
time. 

About the third step. Thanks to Ranks, retrieving the \7Z(V W ) \ ranks of the leading subproblem 
V w of A takes 0(\1Z{V W )\) time. As we said, there is a 1-to-l correspondence between the leading 
suffixes and the root suffixes. Hence, retrieving the suffixes of V w 's and their keys takes 0§V W \). We 
will deal with the cost of the call to MselMset later. Scanning the pivotal multisets and tagging 
the the columns after the call to MselMset clearly cost OQ|A||). If we exclude MselMset, the 
cost of the third step is 0(||j4||) (since \V W \ = \\A\\). 

The cost of the fourth step is dominated by the cost of the call Slice(j4) which, by Lemma Q] 
is 0(||^4|| +|Ui=i Ai — A\), where A%, ... ,A q are the agglomerates into which A is sliced. 

Let us consider the fifth step and, in particular, its three substeps described in Section 15.21 In 
the first substep, we call Lead Visit on the root of each agglomerate in undecided_g. Lead Visit 
accesses the nodes of the skip tree of the agglomerate it received in input and for each node 
does O(l) time worth of work. In the second substep we compute the prefix sum array SConj of 
Conj for each Ai G undecided_g and then we tag each column of Ai. So both substeps require 
^(S{A,eundecided_g} ll^ll) time. In the third substep we call Slice(Aj) for each Ai € undecided_g. 
All those calls clearly dominates the total cost of the substep. For each Ai £ undecided^, let 
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. . . , Ai t ... , Ai t , be the t{i) + 1 agglomerates into which Aj is sliced (for simplicity's sake let 

i.e. an exhausted agglomerate, for each Ai). By Lemma [TJ the 



us assume that there is a A 



total cost of the third substep is C^E^eundecidecLg} (Pill + U$=i +1 A i 3 ~ A 

Finally, in the sixth step we operate on each A{ € joinable_g. We have two cases in which 
we either do a Join or a SliceJoin between Ai and the agglomerate Ai* with which Ai is join- 
able. In the second case, let A^i be the agglomerate "sliceable" from A^ such that {cj,rj) is 
a column of A^i iff T c ._i 6 V z £ Ai. By Lemmas [2] and [3] the total cost of the sixth step is 



0(E{^ejoinabie_g}ll A Jj in the first case and (E^ejoinabie.g} \\ A i II +l^i*i I J m th e second. Let 
us denote with slice joinable^g all the A^s in joinable^g that need a SliceJoin in the eighth 
step. 

Excluding the cost of the call to MselMset in the third step and, since S{ J 4 i eundecided_g} 11-^11 + 
S{AiGjoinabie_g} ll^ll — ll-^ll * ne f° ur th step some of the Ai's sliced from A may have been moved 
to unsolved_g), the total time for the RAP is 

Of\\A\\ + T + A + $) 
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To complete the analysis of rap let us introduce the events. During the algorithm five kinds 
of crucial events happen, (a) Column fusions: when during a Join(^4',A) (or at the end of a 
SliceJoin) a column of (cj,rj) of A' is fused with the column (cj,rj) of A such that rj + 1 = Cj, 
to form the column {ci,rj). (b) Subproblem creations: when during a Slice (or at the beginning of 
a SliceJoin) some subproblem Vj is partitioned into smaller subproblems 7\, . . . ,Vi p , we have p 
subproblem creation events, (c) Suffix exhaustions: when during the third step of a rap for A, after 
the MselMset call, some suffix T e € V w € A, where V w is the leading subproblem of A, becomes 
exhausted (i.e. T e belongs to a pivotal multiset Adi, thus we know for sure that it is not one of the 
wanted suffixes), (d) Suffix discoveries: when during the third step of a rap for A, some suffix T w 
in V w , the leading subproblem of A, is recognized as one of the wanted suffixes (i.e. T w belongs 
to a pivotal multiset Ti such that \T%\ = 1). (e) Inner collisions: when during the second step of 
a RAP for A, we encounter a column (cj,rj) of A such that {cj,rj), where cj = rj + 1, is also in A 
while T Ci and T Cj do not belong to the same subproblem. 

A column is never divided into smaller ones and a subproblem is never merged with others to 
form a larger subproblem. Also, when a suffix becomes exhausted or a wanted suffix is discovered, 
they can never be in an unsolved subproblem again. Finally, the same inner collision cannot be 
repeated after the RAP in which it has been detected has ended, since the two columns colliding 
will not be part of the same agglomerate any longer. By all the above, it is easy to see that one 
particular event cannot be repeated twice. Since there are N suffixes and we start with N columns 
and one subproblem, we can conclude that the total number of events during the computation is 
O(N), or < 6iV to be precise. 

Let us establish how many events take place during a generic RAP for an agglomerate A. Let 
us start with subproblem creations. In the fourth step, for each agglomerate Ai sliced from A, we 
have \Ai — A\ subproblem creations, one for each new (i.e. not coming from A) subproblem in Ai. 
Analogously, in the third substep of the fifth step (Section 15. 2p . for each Ai E undecided^g and 



for each Ai. sliced from Ai, we have L4j. — AA subproblem creations. Finally, in the eighth step, 
for each Ai € slice joinable^g, we have |Aj*i| subproblem creations. Summing up, r + A + 
subproblem creations take place during the RAP for an agglomerate A. The total number of suffix 
discoveries and suffix exhaustions occurring during the RAP on A is equal to the total number 
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of columns of the agglomerates that in the fourth step are moved either to exhausted_g or to 
undecided^. Each column of each agglomerate moved to undecided_g corresponds to a suffix 
exhaustion. Each column of each agglomerate moved to exhausted_g corresponds to either a suffix 
discovery or a suffix exhaustion. Analogously, the numbers of column fusions and inner collisions 
during the RAP of A are equal to the total numbers of columns of the A^s that in the fourth step 
are moved to joinable_g and unsolved_g, respectively. Since ^?=ill^«ll = II-^IIj we can conclude 
that the total number of events occurring during the RAP for A is \\A\ \ + T + A + As we have seen, 
if we exclude the calls to MselMset, the cost of the RAP for A is 0(\\A\\ + T + A + $). Therefore 
the total time required by all the raps minus the time for the MselMset calls is of the order of 
the total number of events, which is O(N). I 

7.4 Proof of Theorem Q] 

Theorem [1] follows directly from Lemmas [18] and [T9l 

Lemma 18 Given a text T of N symbols drawn from an unbounded alphabet and K consecutive 
ranks r±,. . . ,rx ?"2 = f\ + 1> ^3 = 7*2 + 1> • ■ • > fK = fR-i + lj; the K text suffixes of ranks 

ri,...,rx can be found using 0(K log K + N) time and comparisons. 

Proof: To retrieve the wanted suffixes in 0(K log K + N) time we first apply the suffix multi- 
selection algorithm on T with only n and tk- Then we go through an intermediate stage that 
takes the subproblems left by the suffix multi-selection and prepares them for a second suffix multi- 
selection. After that we apply again the suffix multi-selection on T but this time (a) we use all 
the ranks ri,...,rx and (6) we skip the initialization stage (because of the work done in the 
intermediate stage). 

Let us give the details of the process. 

We execute the suffix multi-selection algorithm on T with 1Z ={ri,rx}- After the computation 
ends, we have the two wanted suffixes, let them be T\ and T r (J] < T r ), the exhausted agglomerates 
and a subproblem list SubList = (Vq, . . . , 7\,, . . . . . . ,V P ), (i/ < i r < p), with the following 

properties, (a) Vi t = {T;} and V% r = {T r } (they are the only solved subproblems). (b) V% < Vi v 
Vi x < Vi' < Vi r and V% T < Vi", for each i < ij, ij < i' < i r and i" > i r , respectively, (c) All together 
the subproblems Vi' with i\ < i' < i r contains exactly K — 2 suffixes and they are the ones with 
ranks vi, ■ ■ ■ , Hf-i (but for each of those suffixes we do not know which r2, ■ ■ ■ , r^-i is its rank). 

The intermediate stage has the following steps. 

First, for each subproblem in SubList we retrieve its suffixes. Recall that for each agglomerate 
Ai only contact and root suffixes are explicitly stored during the computation. To retrieve all the 
suffixes of each Vj G Ai we simply need to visit the tree of Ai from its root with Suffix Visit (V r ) 
defined as follows, (i) If V r is a leaf then V r = {T c \ (c, r) is in V r 's column list}. After we 
retrieved the suffixes of V r we return the set {Tj + i | Tj € V r }. (ii) Otherwise, if V r is a contact 
node (but not a leaf) we retrieve Con = {T c \ (c,r) is in TVs column list}. (Hi) In any case, we 
call Suffix Visit (V ri ) for each children V Ti of V r , let S be the set of all the suffixes we receive from 
all these recursive calls, (iv) Then, we set V r = Con U S and we return the set {T. L+ i \ Tj G V r }. 

Second, with a scan of SubList, we do the following. We merge Vq with all the subproblems in its 
neighborhood JVo, let them be Vi,V2, ■ ■ ■ , V no , into one (the subproblems in the same neighborhood 
are adjacent in SubList). We do the same for V no +l an d its neighborhood V no +2, ■ ■ ■ ,V ni , then 
for V ni +i and so forth until all the subproblems Vi < V^ have been treated. After that we do the 
same for all the subproblems V" > Vi r . Finally we merge all the subproblems V^ with ij < i' < i r 
into one, let it be V#. After the first step, the new SubList = (Vq, . . . ,Vi v V#,Vi r , ■ ■ ■ ,V P >) 
maintains the same properties of the original one. However, the meaning of the integer labels of 
the subproblems may have changed. Now for each subproblem Vi G SubList, ii is the number of 



27 



suffixes of T that are lexicographically smaller than each Tj £ V% (because now for each V% we have 
that Mi = Vi, whereas originally that was guaranteed only for Vi l and V% r ) . 

Third, for each T; e P # , let Tf's key be its first symbol T[i). We call MselMset(:P # , r 2 , r^_i) 
and we obtain V#s pivotal subsets Mo, T\, M\, . . . ,Tt,Mt- Since we know that V# contains all 
and only the K — 2 suffixes with ranks r 2 , ■ ■ ■ , r^-i, all the Mi are void. Thus, from each J-{ 
we create a subproblem V#i with the following properties: (a) \V#i\ = \Tl(V#i)\ and (b) l^i is 
the number of suffixes of T smaller than each one in V#i- After this step we have SubList = 
(Vo,...,Vi l ,V m ,...,V # t,Vi r ,...,V p/ ). 

Fourth, from each V{ with i < i\ or % > i r (they are all exhausted) we make an agglomerate 
Ai and we move it to exhausted^. We do the same for Vi t and 7\ (although, as subproblems, 
they are solved ones). Finally, from each V#j we make an agglomerate A#i and we move it to 
unsolved^. 

After the intermediate stage, we apply again the suffix multi-selection algorithm on T with 
the full rank set 1Z = {r%, . . . ,r^} in the following way. We skip the initialization stage com- 
pletely and we start immediately with the refine and aggregate stage (using SubList, unsolved_g, 
exhausted^, the agglomerates and the subproblems we already have after the intermediate stage). 
The rest of the computation proceeds normally. 

Let us now evaluate the cost of the whole computation. The first call of the suffix multi- 
selection algorithm is done with just two ranks, n and rx- Thus, by Theorem [31 its cost is O(N). 
Let us consider the intermediate stage. The first step requires 0(N) time: for each agglomerate 
Ai, Suffix Visit costs O(sufi) where sufi = Y^JveAA^A- ^ or ^ ne secon d step a scan of SubList 
is enough and the cost is O(N). The cost of the third step si dominated by the cost of the call to 
MselMset. Since it is done on a multiset of K — 2 elements (and with a set of K — 2 ranks), its 
cost is clearly 0(K log K). Finally, an O(N) time scan of SubList is enough for the fourth step. 

Let us now consider the cost of the second execution of the suffix multi-selection. The complexity 
proof is the same of the one for Theorem [3] except for two aspects. First, there is no initialization 
stage, thus the cost of finding the pivotal multisets of{T[l], . . . , T[iV]} disappears. Second, the total 
contribution of the MselMset calls made during the whole refine and aggregate stage changes as 
follows. The total number of suffixes of all the leading subproblems is at most K — 2. Thus the 
total number of virtual strings is at most K — 2. On the other hand, the events that take place 
during all the raps are still 0{N). Thus, the total number of virtual symbols in the virtual strings 
is O(N). All the other additional costs of the raps remain the same. Therefore, by Lemma [15l we 
have that the total cost of the second call of the suffix multi-selection is is 0(K log K + N). I 

For any two subproblems Vi and TV such that H% 7^ A/"j/, let lcp(Vi,Vi>) be the length of the 
longest common prefix of any two suffixes Tj i £ V% and Tj., € V^. Let us extend the notation to 
neighborhoods: for any two Mi 7^ Mi>, lcp(Mi,Mi>) is equal to lcp(Vi,Vi')- We have the following: 

Lemma 19 The suffix multi-selection algorithm can be modified (without changing its asymptotical 
time complexity) so that lcp(Vi,Vi') can be computed in 0(1) time, for any two Vi,Vi' such that 
Mi / Mi> . 

Proof: As we have seen, for any subproblem Vi, the subproblems in its neighborhood Mi are in 
contiguous positions in SubList. We maintain another list NeighList whose elements represent the 
neighborhoods: the i-th element in NeighList represents the i-th contiguous group of subproblems 
in SubList forming a neighborhood. Each Vi in SubList has a link to the element in NeighList 
corresponding to its neighborhood. We also maintain a third list LcpList such that if Vi and V{> are 
in two adjacent neighborhoods NeighList[j] and NeighList[j + 1] then LcpList[j] = lcp(Vi,Vi>). 
A cartesian tree is maintained on LcpList and a structure for dynamic LCA queries (e.g. [5]) is 
maintained on the cartesian tree. 
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SubList, NeighList and LcpList are updated in the finishing phase of the Slice operation 
(see Section f5.3.3j) . As we have seen, a subproblem Vi in SubList is replaced by a partitioning 
7\, . . . ,Vi r of it. If Vi was active then the partitioning is necessarily a refining one, otherwise we 
do not care. Thus, if V% was inactive, the set of the suffixes whose subproblem belong to Mi does 
not change and neither NeighList nor LcpList needs to be updated. 

On the other hand, if Vi was active then Mi = Vi and Mi } = P^., for each 1 < j < r. Thus, 
the element in NeighList for Mi is replaced by the ones for Mi 1} ■ ■ ■ ,M% r . Since the partition- 
ing is a refining one, we have that lcp(M p , Mi±) = lcp{M p ,Mi) and lcp{Mi r ,M s ) = lcp(Mi,M s ), 
where M p and M s are (were) the predecessor and successor of Vi in NeighList. Thus the cor- 
responding entries in LcpList do not need to be updated (and neither does the cartesian tree 
nor the structure for LCA queries). The values lcp{Mi 17 Mi 2 ) , • • • , lcp(Mi r _ 1 ,Mi r ) can be easily 
found by doing r — 1 LCA queries. Since Vi has been refined, we know that none of the values 
lcp(Mi 1 ,Mi 2 ) , . . . , lcp(^Mi r _ 1 ,Mi r ) can be smaller than either lcp(M p ,Mi) or lcp(Mi,M s ). Thus each 
pair of insertions of Mi, in NeighList and of lcp(Mi j ,Mi j+1 ) in LcpList corresponds to the inser- 
tion of a leaf on the cartesian tree built on LcpList. This kind of updates of the structure for LCA 
queries can be done in 0(1) time (hence adding an extra 0(N) term to the complexity bound for 
the suffix multi-selection algorithm). I 

8 Conclusions 

We studied partial compression and text indexing problems, and as a technical piece, the suffix 
multi-selection problem. The main theme is that when comparing an arbitrary set of suffixes which 
might overlap in sophisticated ways, we need to devise methods to avoid rescanning characters to 
get optimal results. We achieve this with a variety of structural observations and carefully arranging 
computations, and achieve bounds optimal with respect to those known for atomic elements. Other 
partial suffix problems will be of great interest in Stringology and its applications. 
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: 2, e g = 
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= 4, £12 
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SubList== {V ,Vl,V2,V29,V3,V32,V4,V24,V 6 ,V8,Vg,Vll,Vl4,V 5 ,V7,Vl0,Vl2, 
Pl5, Pl7, Pl3, Pl6, Pl8, "Pl9, "P30, P21, "P33, P23, P20, P22, P25, "P26, "P27, ^28) 



Figure 2: "Snapshot" of the computation for a text T with N = 181 symbols and K = 34 ranks in 
1Z. Agglomerates A±, A2, and A3 are unsolved, while Aq, A4 and A§ are exhausted. The columns 
of each agglomerate are pictured beside it (as contiguous substrings of T). T is pictured as a 
partitioning of the columns of the agglomerates. 
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